This overview explains a reference processing model used for HTML, and in particular the SGML concept of a document character set. An actual implementation may widely differ in its internal workings from the model given below, but should behave as described to an outside observer.
Because there are various widely differing encodings of text, SGML
does not directly address how the sequence of characters that
constitutes an SGML document in the abstract sense are encoded
by means of a sequence of octets (or occasionally bit groups of
another length than 8) in a concrete realization of the document
such as a computer file. This encoding is called the external
character encoding of the concrete SGML document, and it should
be carefully distinguished from the document character set of
the abstract HTML document. SGML views the characters as a single
set (called a character repertoire
), and a code set
that assigns an integer number (known as character number
)
to each character in the repertoire. The document character set
declaration defines what each of the character numbers represents
[GOLD90, p. 451]. In
most cases, an SGML DTD and all documents that refer to it have
a single document character set, and all markup and data characters
are part of this set.
HTML, as an application of SGML, does not directly address the question the external character encoding. This is deferred to mechanisms external to HTML, such as MIME as used by the HTTP protocol or by electronic mail.
For the HTTP protocol [RFC2068],
the external character encoding is defined by the charset
parameter
of the Content-Type
field of the header of an HTTP response. For
example, to indicate that the transmitted document is encoded in the
JUNET
encoding of Japanese
[RFC1468], the header will
contain the following line:
Content-Type: text/html; charset=ISO-2022-JPThe term
charsetin MIME is used to designate a character encoding, rather than merely a coded character set as the term may suggest. A character encoding is a mapping (possibly many-to-one) of sequences of octets to sequences of characters taken from one or more character repertoires.
The HTTP protocol also defines a mechanism for the client to specify the character encodings it can accept. Clients and servers are strongly requested to use these mechanisms to assure correct transmission and interpretation of any document. Provisions that can be taken to help correct interpretation, even in cases where a server or client do not yet use these mechanisms, are described in section 6.
Similarly, if HTML documents are transferred by electronic mail, the external character
encoding is defined by the charset
parameter of the Content-Type
MIME header line [RFC2045], and
defaults to US-ASCII in its absence.
No mechanisms are currently standardized for indicating the external character encoding of HTML documents transferred by FTP or accessed in distributed file systems.
In the case any other way of transferring and storing HTML documents are defined or become popular, it is advised that similar provisions be made to clearly identify the character encoding used and/or to use a single/default encoding capable of representing the widest range of characters used in an international context.
Whatever the external character encoding may be, the reference processing model translates it to the document character set specified in Section 2.2 before processing specific to SGML/HTML. The reference processing model can be depicted as follows:
[resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
[manager] [parser]
^ |
| |
+----------+
The decoder is responsible for decoding the external representation
of the resource to the document character set. The entity manager,
the parser, and the application deal only with
characters of the document character set. A display-oriented part of
the application or the display machinery itself may again convert
characters represented in the document character set to some other
representation more suitable for their purpose. In any case, the
entity manager, the parser, and the application, as far as character
semantics are concerned, are using the HTML document character set
only.
An actual implementation may choose, or not, to translate the document into some encoding of the document character set as described above; the behaviour described by this reference processing model can be achieved otherwise. This subject is well out of the scope of this specification, however, and the reader is invited to consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88] [GOLD90] [VANH90] [SQ91] for further information.
The most important consequence of this reference processing model is
that numeric character references are always resolved with respect to
the fixed document character set, and thus to the same characters,
whatever the external encoding actually used. For an example, see
Section 2.2.
2.2. The document character set
The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE].
NOTE — implementers should be aware that ISO 10646 is amended from time to time; 4 amendments have been adopted since the initial 1993 publication, none of which significantly affects this specification. A fifth amendment, now under consideration, will introduce incompatible changes to the standard: 6556 Korean Hangul syllables allocated between code positions 3400 and 4DFF (hexadecimal) will be moved to new positions (and 4516 new syllables added), thus making references to the old positions invalid. Since the Unicode consortium has already adopted a corresponding amendment for inclusion in the forthcoming Unicode 2.0, adoption of DAM 5 is considered likely and implementers should probably consider the old code positions as already invalid. Despite this one-time change, the relevant standard bodies appear to remain committed not to change any allocated code position in the future. To encode Korean Hangul irrespective of these changes, the combining Hangul Jamo in the range 1110-11F9 can be used.The adoption of this document character set implies a change in the SGML declaration specified in the HTML 2.0 specification (section 9.5 of [RFC1866]). The change amounts to removing the first BASESET specification and its accompanying DESCSET declaration, replacing them with the following declaration:
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 2147483486 160
Making the UCS the document character set does not create
non-conformance of any expression, construct or document that is
conforming to HTML 2.0. It does make conforming certain constructs
that are not admissible in HTML 2.0. One consequence is that data
characters outside the repertoire of ISO-8859-1, but within that of
UCS-4 become valid SGML characters. Another is that the upper limit
of the range of numeric character references is extended from 255 to
2147483645;
thus, И is a valid reference to a CYRILLIC CAPITAL LETTER I(И). [ERCS] is a good source of information on Unicode and SGML, although its scope and technical content differ greatly from this specification.
NOTE — the above SGML declaration, like that of HTML 2.0, specifies the character numbers 128 to 159 (80 to 9F hex) as UNUSED. This means that numeric character references within that range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor ISO 10646 contain characters in that range, which is reserved for control characters.Another change was made from the HTML 2.0 SGML declaration, in the belief that the latter did not express its authors' true intent. The syntax character set declaration was changed from ISO 646.IRV:1983 to the newer ISO 646.IRV:1991, the latter, but not the former, being identical with US-ASCII. In principle, this introduces an incompatibility with HTML 2.0, but in practice it should increase interoperability by i) having the SGML declaration say what everyone thinks and ii) making the syntax character set a proper subset of the document character set. The characters that differ between the two versions of ISO 646.IRV are not actually used to express HTML syntax.
ISO 10646-1:1993 is the most encompassing character set currently existing, and there is no other character set that could take its place as the document character set for HTML. If nevertheless for a
specific application there is a need to use characters outside this standard, this should be done by
avoiding any conflicts with present or future versions of ISO 10646, i.e. by assigning these characters to a private zone of the UCS-4 coding space [ISO-10646 section 11]. Also, it should be borne in mind that such a use will be highly unportable; in many
cases, it may be better to use inline bitmaps.
2.3. Undisplayable characters
With the document character set being the full ISO 10646, the possibility that a character cannot be displayed due to lack of appropriate resources (fonts) cannot be avoided. Because there are many different things that can be done in such a case, this document does not prescribe any specific behaviour. Depending on the implementation, this may also be handled by the underlaying display system and not the application itself. The following considerations, however, may be of help: