Yergeau, Nicol, Adams & Dürst — Internationalization of HTML

2. The document character set

2.1. Reference processing model

This overview explains a reference processing model used for HTML, and in particular the SGML concept of a document character set. An actual implementation may widely differ in its internal workings from the model given below, but should behave as described to an outside observer.

Because there are various widely differing encodings of text, SGML does not directly address how the sequence of characters that constitutes an SGML document in the abstract sense are encoded by means of a sequence of octets (or occasionally bit groups of another length than 8) in a concrete realization of the document such as a computer file. This encoding is called the external character encoding of the concrete SGML document, and it should be carefully distinguished from the document character set of the abstract HTML document. SGML views the characters as a single set (called a character repertoire), and a code set that assigns an integer number (known as character number) to each character in the repertoire. The document character set declaration defines what each of the character numbers represents [GOLD90, p. 451]. In most cases, an SGML DTD and all documents that refer to it have a single document character set, and all markup and data characters are part of this set.

HTML, as an application of SGML, does not directly address the question the external character encoding. This is deferred to mechanisms external to HTML, such as MIME as used by the HTTP protocol or by electronic mail.

For the HTTP protocol [RFC2068], the external character encoding is defined by the charset parameter of the Content-Type field of the header of an HTTP response. For example, to indicate that the transmitted document is encoded in the JUNET encoding of Japanese [RFC1468], the header will contain the following line:

Content-Type: text/html; charset=ISO-2022-JP
The term charset in MIME is used to designate a character encoding, rather than merely a coded character set as the term may suggest. A character encoding is a mapping (possibly many-to-one) of sequences of octets to sequences of characters taken from one or more character repertoires.

The HTTP protocol also defines a mechanism for the client to specify the character encodings it can accept. Clients and servers are strongly requested to use these mechanisms to assure correct transmission and interpretation of any document. Provisions that can be taken to help correct interpretation, even in cases where a server or client do not yet use these mechanisms, are described in section 6.

Similarly, if HTML documents are transferred by electronic mail, the external character encoding is defined by the charset parameter of the Content-Type MIME header line [RFC2045], and defaults to US-ASCII in its absence.

No mechanisms are currently standardized for indicating the external character encoding of HTML documents transferred by FTP or accessed in distributed file systems.

In the case any other way of transferring and storing HTML documents are defined or become popular, it is advised that similar provisions be made to clearly identify the character encoding used and/or to use a single/default encoding capable of representing the widest range of characters used in an international context.

Whatever the external character encoding may be, the reference processing model translates it to the document character set specified in Section 2.2 before processing specific to SGML/HTML. The reference processing model can be depicted as follows:

  [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
                         [manager]  [parser]
                             ^          |
                             |          |
                             +----------+
The decoder is responsible for decoding the external representation of the resource to the document character set. The entity manager, the parser, and the application deal only with characters of the document character set. A display-oriented part of the application or the display machinery itself may again convert characters represented in the document character set to some other representation more suitable for their purpose. In any case, the entity manager, the parser, and the application, as far as character semantics are concerned, are using the HTML document character set only.

An actual implementation may choose, or not, to translate the document into some encoding of the document character set as described above; the behaviour described by this reference processing model can be achieved otherwise. This subject is well out of the scope of this specification, however, and the reader is invited to consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88] [GOLD90] [VANH90] [SQ91] for further information.

The most important consequence of this reference processing model is that numeric character references are always resolved with respect to the fixed document character set, and thus to the same characters, whatever the external encoding actually used. For an example, see Section 2.2.

2.2. The document character set

The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE].

NOTE — implementers should be aware that ISO 10646 is amended from time to time; 4 amendments have been adopted since the initial 1993 publication, none of which significantly affects this specification. A fifth amendment, now under consideration, will introduce incompatible changes to the standard: 6556 Korean Hangul syllables allocated between code positions 3400 and 4DFF (hexadecimal) will be moved to new positions (and 4516 new syllables added), thus making references to the old positions invalid. Since the Unicode consortium has already adopted a corresponding amendment for inclusion in the forthcoming Unicode 2.0, adoption of DAM 5 is considered likely and implementers should probably consider the old code positions as already invalid. Despite this one-time change, the relevant standard bodies appear to remain committed not to change any allocated code position in the future. To encode Korean Hangul irrespective of these changes, the combining Hangul Jamo in the range 1110-11F9 can be used.
The adoption of this document character set implies a change in the SGML declaration specified in the HTML 2.0 specification (section 9.5 of [RFC1866]). The change amounts to removing the first BASESET specification and its accompanying DESCSET declaration, replacing them with the following declaration:
  BASESET "ISO Registration Number 177//CHARSET
           ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
           //ESC 2/5 2/15 4/6"
  DESCSET  0   9     UNUSED
           9   2     9
           11  2     UNUSED
           13  1     13
           14  18    UNUSED
           32  95    32
           127 1     UNUSED
           128 32    UNUSED
           160 2147483486 160
Making the UCS the document character set does not create non-conformance of any expression, construct or document that is conforming to HTML 2.0. It does make conforming certain constructs that are not admissible in HTML 2.0. One consequence is that data characters outside the repertoire of ISO-8859-1, but within that of UCS-4 become valid SGML characters. Another is that the upper limit of the range of numeric character references is extended from 255 to 2147483645; thus, И is a valid reference to a CYRILLIC CAPITAL LETTER I (И). [ERCS] is a good source of information on Unicode and SGML, although its scope and technical content differ greatly from this specification.
NOTE — the above SGML declaration, like that of HTML 2.0, specifies the character numbers 128 to 159 (80 to 9F hex) as UNUSED. This means that numeric character references within that range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor ISO 10646 contain characters in that range, which is reserved for control characters.
Another change was made from the HTML 2.0 SGML declaration, in the belief that the latter did not express its authors' true intent. The syntax character set declaration was changed from ISO 646.IRV:1983 to the newer ISO 646.IRV:1991, the latter, but not the former, being identical with US-ASCII. In principle, this introduces an incompatibility with HTML 2.0, but in practice it should increase interoperability by i) having the SGML declaration say what everyone thinks and ii) making the syntax character set a proper subset of the document character set. The characters that differ between the two versions of ISO 646.IRV are not actually used to express HTML syntax.

ISO 10646-1:1993 is the most encompassing character set currently existing, and there is no other character set that could take its place as the document character set for HTML. If nevertheless for a specific application there is a need to use characters outside this standard, this should be done by avoiding any conflicts with present or future versions of ISO 10646, i.e. by assigning these characters to a private zone of the UCS-4 coding space [ISO-10646 section 11]. Also, it should be borne in mind that such a use will be highly unportable; in many cases, it may be better to use inline bitmaps.

2.3. Undisplayable characters

With the document character set being the full ISO 10646, the possibility that a character cannot be displayed due to lack of appropriate resources (fonts) cannot be avoided. Because there are many different things that can be done in such a case, this document does not prescribe any specific behaviour. Depending on the implementation, this may also be handled by the underlaying display system and not the application itself. The following considerations, however, may be of help:

Back ToC Next