This FAQ is about using XML for Chinese-language documents. For general English-language XML or SGML FAQs see:
John Lamp and Dave Megginson's comp.text.sgml FAQ at http://lamp.man.deakin.edu.au/sgml/sgmlfaq.txt;
Peter Flynn's XML FAQ at http://www.ucc.ie/xml/ (this is also available in Japanese, Korean and Spanish: does anyone want to translate it into Chinese?).
An XHTML Chinese version of this FAQ is also available in the following character encodings: Big 5, UTF-8, and GB2312.
A.1. What is XML?
XML (eXtensible Markup Language) is a simple language for marking up structures in text documents. It is based on an International Standard -- Standard Generalized Markup Language (SGML) -- International Organization for Standardization (ISO) ISO 8879:1986. It looks like HTML. You can create and use your own tags and document structures with it. You can also use it for serializing from databases.
A.2. Who developed it?
ISO SGML orginally came out of IBM, but has had a lot of input from different companies. XML is developed by the World Wide Web Consortium (W3C); the director of W3C is Tim Berners-Lee, who invented the WWW. The XML project leader was Jon Bosak, chief Information Architect at Sun Microsystems. The XML specification was co-written by representatives of Netscape, Microsoft, and a large academic project called the Text Encoding Initiative (TEI). The W3C Special Interest Group (W3C XML SIG) had representatives from over 100 different companies and invited experts.
A.3. Which companies are using it or supporting it?
Microsoft, Netscape, Sun, IBM, Corel, Adobe, Oracle, RealAudio...almost everyone.
A.4. How do I know if I am using XML?
Most XML is hidden; you write your own specialist markup language using it. There are many specialist markup languages built using XML now. For example,
RealAudio's new RealPlayer uses W3C SMIL (Synchronized Media Interchange Language);
Netscape's "What's Related" uses W3C RDF (Resource Description Framework);
Microsoft's Internet Explorer "Channels" uses CDF (Channel Definition Format).
A.5. Can I use Chinese Data?
Yes. All conforming XML processors must support ISO 10646 characters. ISO 10646 is a big character set that is an ISO standard. These include the characters from Big5 and GB2312.
However, XML is just starting. So most XML software has not been tested on Chinese data yet (December, 1999).
A.6. Can I use Chinese element names?
Yes. All conforming XML processors must allow you to use Han ideographs for element names. But you can only use the characters defined in the character set you are using; you cannot use numeric character references in names.
There is a particular problem with Big5. See Question B.12.
A.7. What is XHTML?
XHTML is the name given to HTML, when it uses XML syntax. XML syntax is stricter than old HTML: you cannot leave out most end-tags, it is case-sensitive, and you must have quotes on attribute values.
A.8. What is Well-formed and Valid?
An XML document is well-formed if its syntax is correct: all the required delimiters are correct, and end-tags correspond to their start-tags. All XML processors are required to inform the user if a document is not well-formed; processing will probably halt. (HTML browsers forgive syntax errors; XML processors do not.)
An XML document is valid if its element structure conforms to the "content model" of the optional "markup declarations" in the "prolog" of the document. There are special tools caled validators which can help you.
A content model is like an "assert" statement in C or C++ programming languages: it lets you check that the XML document does have the structure you expect or require it to have. For example, some software may expect that the element "HTML:img" has an attribute "src". If you can make assertions about the structure of the document, there are fewer cases you need to program for.
B.1. Can I use Big5?
Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which other character set encodings to support.
Outside the XML processor, the document is encoded in Big5.
Inside the XML system, the document is encoded as ISO10646. (ISO10646 is the character set that Java uses. It is the same character set as Unicode.)
If you use Big5, every file ("parseable entity") used in your XML document must start with the following header: ]]>
Why is it important to set the character set correctly? XML will be used for commercial information. If your XML document is not labelled with the correct character set information, an XML processor will reject it. XML moves away from character set guessing (i.e., what HTML does) to explicit markup of character sets.
There is a particular problem with Big5. See Question B.12.
B.1. Can I use GB2312?
Maybe. A conforming XML processor must support UTF-8 and UTF-16 (Unicode). But XML can be encoded using almost any character set. It is the parser-writer's decision which encodings to support.
Outside the XML processor, the document is encoded in GB2312 (e.g. the EUC encoding of GB2312 and ASCII, also called cn-euc);
Inside the XML system, the document is encoded as ISO10646. (ISO10646 is the character set that Java uses. It is the same character set as Unicode.)
If you use GB2312, every file ("parseable entity") used in your XML document must start with the following header: ]]>
B.3. How do I know what character sets the XML software supports?
All XML software must support UTF-8 (and therefore ASCII) and UCS-2 (Unicode). Much XML software will also support other character encodings. In any case, you can always convert ("transcode" ) your XML document into UTF-8 and use any XML software.
B.4. What if I am using Big5 and the character I need is not there?
You can use any character defined by the ISO 10646 universal character set. If Big5 (or GB2312) does not have that character, you can use a "numeric character reference". This looks like where the "5ABC" is a hexadecimal number giving the ISO10646 number. Windows NT's "Character Map" utility lets you see all the ISO 10646 characters in any font.
If the character is a variant on an ISO10646 character, you can make up an element type (and attribute) so that it will display properly. For example, you could borrow the "SPAN" element type from HTML and then use a stylesheet (e.g. Cascading Stylesheets: CSS) to select the font you need for that character.
If the character is not in ISO10646, then you have to use the private-use character area. This is not an improvement on what you have to do now! We hope some better system will be included in XML and ISO10646 in the future.
B.5. What about Web systems that "Transcode"?
Some Japanese Web servers, proxies or browsers automatically convert between Japanese character sets (e.g., between Shift-JIS and EUC-J). This also apparantly occurs in some other languages (e.g. Russian.) As far as we know, this "transcoding" does not happen automatically for Chinese Web servers (e.g., between Big5 and GB2312; between the "traditional" characters and the simplified characters.)
Big5 to GB2312 conversion is not perfect: some characters are missing. It is possible to create an XML-aware "lossless" transcoder--this a transcoder that will convert unavailable characters into Numeric Character References (NCRs). (We have made some in Academia Sinica Computing Centre, for example.)
In order to prevent transcoding, you can send the document using the MIME type application/xml. That is supposed to prevent the document being transcoded. If you are using Apache, then the following may be useful to look at: AddType application/xml XML xml or ForceType application/xml or DefaultType application/xml
However, please note that for HTML, the popular Web browsers use lots of tricks to guess which character set encoding is used. This will probably continue even with HTML-in-XML (also known as "Voyager"). So using application/xml will prevent proxies from transcoding, but the receiving system may have "lossy" transcoding built-in anyway! (See question B.13.)
B.6. What if my Web Server sends the wrong charset?
Many (most) web servers do not send the correct charset in the HTTP/MIME headers. In fact, many Web Servers do not allow you to specify the character set at all!
Here are some guidelines:--
in the future, Web Servers will look at the XML encoding header (but not yet; XML is new);
if your site only serves one encoding, make sure that your webserver sends that as the default;
if your webserver supports HTTP 1.1 content negotiation (e.g. Apache) and you have many different languages, the server will have some system for selecting files using language (e.g. using filenames like file.xml.en and file.xml.cn); or
use a different directory for each file, and use the .htaccess control file to set the language. (If you are using Apache, your Webmaster must give you "AllowOverride FileInfo" permission). See next question.
B.7. How do I send the correct MIME/HTTP headers using Apache .htaccess files?
If you are using a recent version of the Apache server, then your Webmaster must give you "AllowOverride FileInfo" permission. Then you can put a file called .htaccess in any directory. (Note that in MIME terminology, "encoding" means "compression". In the XML encoding header, "encoding" means "coded character set") Here are some lines that may be useful--
DefaultLanguage zh
AddType application/xml XML xml
or
AddType "text/xml; charset=Big5" XML xml
Why is it important to set the language correctly? Because it can help with searching later.
Why do you want to send XML with the MIME type application/xml rather than text/xml? Because then the file will not be transcoded. Transcoding can make characters disappear when going from Big5 to GB2312. (See also question B.6)
B.8. What is the standard attribute xml:lang for?
Every XML element can have an attribute called xml:lang. It lets you set the language you are using. You can use this to help searching and typesetting. Put this attribute on the top-level element in your Chinese XML document. The values you can use for Chinese include:
xml:lang="zh" for any Chinese text;
xml:lang="zh-TW" for Chinese text from Taiwan (i.e., traditional characters);
xml:lang="zh-HK" for Chinese text from Hong Kong (i.e. probably traditional characters) ;
xml:lang="zh-CN" for Chinese text from China (i.e., simplified characters);
xml:lang="zh-SG" for Chinese text from Singapore
Of course, this attribute is very simple, but it is important to label all your documents with what language they use. Then a Chinese Web-Robot can automatically add you text to a WWW index, and a Western Web-Robot will know that it should not add the information. Or an automatic translation service can be invoked. Some words are used differently in different Chinese locales (e.g. Taiwan and China) so it can help with automated translation and searching too.
B.9. Can I mix different kinds of Chinese in the same document?
Yes. Every XML element can take an attribute called "xml:lang" which says which language the element is. This is not the character encoding (e.g. Big5 or GB2312), but the language: for example
...
]]>means that the element is in Chinese, as used in Taiwan. By implication, the element p should use traditional characters.
...
]]>means that the element is in Hong Kong Chinese.
...
]]>means that the element is in Singapore English.
blah...
]]>means that the element p is in Cantonese Chinese ("YUH" is the code for Cantonese in the SIL Enthnologue: http://www.sil.org/ethnologue/countries/Chin.html ) but the subelement z is in English. (Some characters are used phonetically; these kind of characters are dialect-specific and unreadable outside the dialect.) By implication, the element p should use simplified characters.
Of course, you can also invent your own attributes to do anything you like:
...
]]>means that the element contains data in Hong Kong Chinese, but it should use simplified characters. But the attribute specification 'traditional="OK"' is your attribute: you can use it to say that it is a OK to also use the traditional glyph (image).
In XML, you use markup to describe all the interesting information about the data. Then you write a program or stylesheet or report generator to implement what you need, using the markup.
B.10. I heard that Unicode is not a good character set for Chinese!
The Unicode Consortium are a group of companies (including the Japanese company Justsystem, and companies with large Japanese joint operations, like (Fuji-)Xerox) that decided to make a big character set which had all the world's characters. They took the ISO character set ISO 10646 and have added other information: standard names and characteristics. Unicode includes all the characters from GB2312 and (probably) all the characters from Big5. Plus it includes many other characters. (ISO 10646 has several encodings: UTF-8 is 8-bit and UTF-16 is 16 bit. Unicode is a form of UTF-16.)
So Unicode is better than Big5 and GB:2312--it has more characters.
But, there are problems with the ISO 10646 encodings:
The 16-bit fixed-length encoding (UTF-16 or Unicode) takes up no more space than Big5 or GB2312. But the 8-bit variable-length encoding uses 3 bytes per Chinese character. This means that an XML file may be 50% larger using UTF-8 than using Big5. But this number will be less if ASCII markup is used (e.g. if the DTD comes from the West). Markup can be up to 50% of a document's text. And, in any case, the best way to keep file sizes down is by compression....perhaps.
ISO 10646 does not use the same order as any Chinese character set...you cannot use a simple algorithm to convert from Big5 and GB2312 into ISO10646. You must use a big table. But, on the other hand, ISO 10646 puts the Chinese characters into an order that may be more useful for sorting. And it removes duplicated characters, so searching may be better too. (I have been told that GBK is a character set which has all the ISO 10646 characters but keeps GB2312 characters in their same codepoints. That may be a good character set in some cases.)
So XML files do not have to be encoded in UTF-8 or UTF-16. You can use Big5 or GB2312. But not much XML software supports the Chinese character sets. So it is good advice to move to UTF-8 or UTF-16 in the long run.
B.11. Why does software xxx work with Big5: the documentation says it does not?
Big5 is an "7-bit unsafe" "ASCII-family" coded character set.
"ASCII-family" coded character sets (ASCII, ISO646, ISO8859-*, UTF-8, EUC, Big5, GB2312) means all the sets which have the ASCII characters in the ASCII codepoints. (E.g., where "A" has the codepoint 65 (0x41).) All ASCII characters have a value less than decimal 128 (0x80).
An "8-bit safe" characters encodings is one in which, if a byte appears which has a value less than 128, then that byte always means the ASCII character. Shift-JIS and Big5 are not 8-bit unsafe, because the second byte of a multiple-byte character code can have a value less than 128 (0x80). The advantage of 8-bit safe encodings is that they are compatible with software which only looks at the ASCII characters for markup recognition.
A "7-bit safe" character encoding is one in which, if a byte appears which has a value less than 64 (0x40), then that byte always means the ASCII character. Shift-JIS and Big5 are not 8-bit safe (because the second byte of a multiple-byte character code can have a value less than 0x128 (0x80)) but they are 7-bit safe (the second byte is always greater than 63 (0x3F)). 7-bit safe encodings are compatible with software which only looks at the ASCII characters less than 64 (0x40) for delimiter recognition. In XML, all the XML delimiters [&]]>%"'] have values less than 64 (0x40).
This means that there is a lot of XML software which will work with XML documents encoded in Big5. But it is an accident, because, strictly, if an XML-system does not understand the encoding given in the XML encoding header, the XML processor should signal an error. In particular, such systems will probably not handle numeric character references (NCR) correctly (See question B.4). But they may be useful anyway, of course, even if they are non-conforming.
There is a particular problem with Big5. See Question B.12.
B.12. Some Big5 documents fail with strange errors? Why?
The second byte of Big5 characters can cause problems on some systems. Big5 is not "8-bit safe" (see Question B.11.)
The problems will show up only on systems which do not convert the Big5-encoded documents into an 8-bit safe internal format (e.g., Unicode, or UTF-8 or UTF-16.) On these systems, some bytes of the Big5 code will be interpreted as the wrong characters.
The first problem occurs when you are using Native Language Markup (e.g., you are using Han Ideograms for element names, attribute names, ID attributes, etc.) There is no way to fix this problem. If you must use that kind of software, then you must avoid using (as in markup) any Big5 character whose second byte is not a valid name character.
The second problem occurs in the very rare case that you are using one of the following Han ideographs in a CDATA section, and that character is followed by the string "]]>. To fix this problem, you can split the CDATA section into two CDATA sections, and sandwich the naughty character between them. The following characters all have the byte 5D as their second byte (in Big5): this is ASCII "]".
兡也包因沘氓侷柵苗孫孫財 崧淫設弼琶跑愍窟榜蒸奭稽 霄瓢館縲擻鼕孃魔釁佉沎岠 狋垚柛胅娭涘罞偟惈牻荺傒 焱菏酡廅滘絺赩塴榗箂踃嬁 澕蓴醊獧螗餟燱螬駸礑鎞瀧 鄿瀯騬醹躕鱕
(Note: If you cannot see all the characters, then see question B.13.)
B.13. I cannot see all the characters on my HTML browser! Why?
If you cannot see all the characters, then
your browser does not treat Numeric Character References correctly (according to HTML 4 or XML rules); or
you do not have the correct font installed or selected;
you browser uses the "encoding" to determine which font to use, and the font it has selected does not have the characters.
Try changing the "Encoding" menu item (e.g. to Big5 or UTF-8): it is under a different menu on different browsers. )
B.14. What is Big5/GCCS, EUDC, and Big5plus?
EUDC (Extended User-defined Characters) is the general name used in Hong Kong for standard sets of user-defined characters (sometimes called by the Japanese term gaiji). They include R&D EUDC, HKUST EUDC and GCCS EUDC.
Big5 was developed in Taiwan. The "traditional" characters are also used in Hong Kong. But Hong Kong also uses other characters which are rarer in Taiwan. The Hong Kong Government has made the Government Chinese Character Set (GCCS), which is Big5 plus an additional 3049 characters. It seems to be in widespread use.
Taiwan standards committees have also recented added an extra 7000 or so characters to Big5, calling it Big5Plus. We cannot be very sure how much it is being used.
Big5/GCCS, EUDC and Big5plus do not have registered IANA encoding names for the Internet. So be careful.
For future interoperability, it is important that all WWW software gives the correct headers in the HTML and XML files. How can you trust your e-commerce data if you cannot know the character set? If you use these new versions of Big5, always put an extra comment or processing instruction at the head of the document to document it. For XML, we suggest that you put, as the second tag in the document, an XML processing instruction with target "ascc-hint" and an attribute "non-IANA".
]]>
and
]]>
C.1. What is the best free XML software for Chinese at the moment?
For Chinese XML, the best browser at the moment (April 1999) is probably Internet Explorer 5.0. The best XML parser is probably IBM's XML Parser for Java. The best XML/SGML parser is probably James Clark's SP software (C++). An XML version of Perl is coming too!
For a listing of current XML tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xml.html#xmlSoftware. (1999-04-13)
C.2. What is the best free XML validators for Chinese at the moment?
There are many XML tools which tell you whether an XML entity is well-formed. There are fewer tools which also check document validity against a DTD: Microsoft has a useful tool available at http://www.microsoft.com/xml/ (under "XML & XSL Demos" and "XML Validator").(1999-04-13)
For a listing of current validators tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xml.html#xmlValResources.
C.3. What is the best free XSL software for Chinese at the moment?
All XSL tools are experimental betas, at the moment. The tools from James Clark (XT) and from IBM (LotusXSL) are probably the best. (1999-04-13)
For a listing of current XSL tools, see Robin Cover's web pages at http://www.oasis-open.org/cover/xsl.html#xslSoftware.
C.4. What is the best free XHTML software for Chinese at the moment?
All XHTML tools are experimental betas, at the moment. Dave Ragget's tidy program at http://www.w3.org/People/Raggett/ can help convert HTML to XHTML. (1999-04-13)
Where can I get more Information?
Try the Chinese XML Now! page at Academia Sinica, Taipei.
What is the "Chinese XML Now!" project?
This is a small project from the Computing Center at Academia Sinica, Taipei. It aims at providing information to developers of Chinese XML Software. It is sometimes difficult for non-Chinese-reading software developers to find useful information on the WWW; and when the project began, there was not much Chinese information on XML either.
The project tries to support material equally in English and Chinese, and in UTF-8, Big5 and GB2312.
Who should can I contact about this FAQ?
We welcome corrections, questions and ideas. The contact for the English language material is Rick Jelliffe: ricko@gate.sinica.edu.tw. The contact for Chinese language is Chin-Tang Chang: ctchang@gate.sinica.edu.tw.
Thanks for corrections to Sidney Lu, John Cowan (plus apologies for the previous typo in his name), and Toshinori Numata
The Chinese XML FAQ (English version)