![]() |
|
| HTML in every language |
|
HTML is the one aspect of the Web that most cried out for internationalization, and yet this was perhaps the easiest aspect to internationalize. The original problem, of course, was character encoding: an HTML document is composed of text and HTML tags to structure and enhance the text. The text characters must be coded, and HTML was designed around the ISO 8859-1 character set (also known as ISO Latin-1), which only supports representation of Western European languages. Other languages are left out in the cold!
When HTML was first applied to other languages, it was done by trial
and error. For some alphabets, it is possible simply to substitute a
character font corresponding to an ISO Latin-1 font, and still achieve
satisfactory results. A Cyrillic font can be used to display a
document in Russian, for example, providing the document coding
matches that of the font, something that is far from certain since
there are several popular codes in Russia. In addition, the document
coding must be known somehow, and the software adjusted accordingly -
a chronic problem on the Web today. But this trick will not
work for many languages that require more advanced processing, such as
contextual analysis, bi-directional text, specific line-wrapping
algorithms, etc. Also a very bad idea is to use the
To solve this problem once and for all, and to make HTML a markup language worthy of its position as the Web's lingua franca, in 1995 the IETF1 HTML working group (html-wg) undertook to extend the HTML specification by adopting the ISO 106462 character set as its reference. This character set contains several tens of thousands of characters, enough to represent almost all the world's languages. This effort resulted in publication of an RFC - (an Internet standard). HTML is now adequate for almost all languages; remaining exceptions will be addressed through future ISO 10646 extensions. In addition to adopting ISO 10646, the new standard adds a few tags to HTML which are essential or useful for various languages, especially those requiring bi-directional text or context analysis. It also includes some rules of "good behavior" for HTTP servers, coding and form submittal mechanisms, and other internationalization factors that fall somewhat beyond the scope of HTML. All the details can be found in RFC 2070, "HTML Internationalization", whose history extends over almost 18 months.
No, HTML Internationalization does not mean that you must write all
your documents in Unicode. In principle, you can encode your documents
in any character set appropriate to the language, but clients
(browsers, etc.) must know which one you have chosen. Also, it is
better to stick with standardized and well-known character sets such
as the ISO 8859 series, important national standards (ISCII, ASMO,
VISCII, etc.), or ISO 10646, of course. To indicate your document's
code, a useful tip is to add the following line to your document
header, immediately following the
Cont. |
Of course, you must replace
Special characters, entities and numerical references By special characters, we usually mean anything not found in ASCII. In fact, the opposite is quite true: it is ASCII that holds a special place in computing by virtue of its quasi-universality. There is a persistent belief that only ASCII characters (unaccented letters, numbers, a few punctuation marks and other symbols) can be used in an HTML document. This belief would have it that accented letters must be coded either as an entity reference (e.g. é for e acute) or a numerical reference (e.g. é for the same e acute). This is not true. These references are legitimate, of course, but they are by no means compulsory; they exist primarily as shortcuts for people who do not have a keyboard capable of entering such characters directly. In addition to rendering your HTML source (the flagged text as it appears in a text editor) difficult to read, references have other drawbacks. For example, if you use a Web search engine to look for documents containing a word with one or more accented letters (or worse, a non-Latin letter), no match will occur if the word was coded using a reference (unless, of course, the search engine takes the trouble to replace the reference with the appropriate character - something the author himself should have done). Furthermore, documents too often contain references to non-existent characters such as ’. Prior to HTML internationalization, the reference character set was ISO 8859-1 which does not include characters between 128 and 159; since then, the reference character set has been ISO 10646 which also does not include any characters in this range which is reserved for control characters.
In short, the right way to code a document is as follows: 1) choose a
character set that is appropriate for the document's language and as
standard as possible; 2) use the characters in the set as much as
possible when typing and avoid using entity or numerical references;
3) save the document and send it with an appropriate character
encoding label (MIME Apart from their obvious use in supporting inadequate keyboards, numerical or entity references (when they exist) also provide additional utility. They can be used to insert characters into a document that are not part of the character set you have chosen. In addition, your software may not support editing using your selected character set (e.g. an unsophisticated Macintosh editor cannot handle ISO 8859-1 - the obvious choice for any document in a Western European language); in a case such as this, you must represent anything not in the ASCII set with references (unless you have another program on hand that is capable of translating the final product code).
Finally, those characters required for markups (<, > and &)
must always be coded by reference whenever you wish to use such
characters alone rather than as a tag. For example, to obtain 1 The Internet Engineering Task Force is the agency responsible for creating and maintaining Internet standards. 2 Also known as Unicode. In fact, Unicode is an industrial standard unrelated to ISO 10646, except that it exactly duplicates the same characters in the same positions. In practice, they can usually be used interchangeably with no problem. |

| The Tango Multilingual Browser will properly display all of Babel's languages. | ![]() |
© 1996, Alis Technologies inc. |