Yergeau, Nicol, Adams & Dürst — Internationalization of HTML

4. Additional entities, attributes and elements

4.1. Full Latin-1 entity set

According to the suggestion of section 14 of [RFC1866], the set of Latin-1 entities is extended to cover the whole right part of ISO-8859-1 (all code positions with the high-order bit set), including the already commonly used  , © (©) and ® (®). The names of the entities are taken from the appendices of SGML [ISO-8879]. A list is provided in section 7.3 of this specification.

4.2. Markup for language-dependent presentation
4.2.1. Overview

For the correct presentation of text in certain languages (irrespective of formatting issues), some support in the form of additional entities and elements is needed.

In particular, the following features are dealt with:

Some of the above features need very little additional support; others need more. The additional features are introduced below with brief comments only. Explanations on cursive joining behaviour and bidirectional text follow later. For cursive joining behaviour and bidirectional text, this document follows [UNICODE] in that: i) character semantics, where applicable, are identical to [UNICODE], and ii) where functionality is moved to HTML as a higher level protocol, this is done in a way that allows straightforward conversion to the lower-level mechanisms defined in [UNICODE].

4.2.2. List of entities, elements, and attributes

First, a generic container is needed to carry the LANG and DIR (see below) attributes in cases where no other element is appropriate; the SPAN element is introduced for that purpose.

A set of named character entities is added for use with bidirectional rendering and cursive joining control:

<!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
<!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->
<!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->
<!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->
These entities can be used in place of the corresponding formatting characters whenever convenient, for example to ease keyboard entry or when a formatting character is not available in the character encoding of the document.

Next, an attribute called DIR is introduced, restricted to the values LTR (left-to-right) and RTL (right-to-left) and admitted by most elements, for the indication of directionality in the context of bidirectional text (see 4.2.4 below for details). Since any text and many other elements (e.g. tables) can logically be assigned a directionality, all elements except BR, HR, BASE, NEXTID, and META admit this attribute. The DTD reflects this. It is also intended that any new element introduced in later versions of HTML will admit the DIR attribute, unless there is a good reason not to do so.

A new phrase-level element called BDO (BIDI Override) is introduced, which requires the DIR attribute to specify whether the override is left-to-right or right-to-left. This element is required for bidirectional text control; for detailed explanations, see section 4.2.4.

The phrase-level element Q is introduced to allow language-dependent rendering of short quotations depending on language and platform capability. As the following examples show, in particular the quotation marks surrounding the quotation are affected: "a quotation in English", `another, slightly better one', ,,a quotation in German'', « a quotation in French ». The contents of the Q element does not include quotation marks, they have to be added by the rendering process.

NOTE — Q elements can be nested. Many languages use different quotation styles for outer and inner quotations, and this should be respected by user-agents implementing this element.
NOTE — minimal support for the Q element is to surround the contents with some kind of quotes, like the plain ASCII double quotes. As this is rather easy to implement, and as the lack of any visible quotes may affect the perceived meaning of the text, user-agent implementors are strongly requested to provide at least this minimal level of support.

Many languages require superscript text for proper rendering: as an example, the French Mlle Dupont should have lle in superscript. The SUP element, and its sibling SUB for subscript text, are introduced to allow proper markup of such text. SUP and SUB contents are restricted to PCDATA to avoid nesting problems.

Finally, in many languages text justification is much more important than it is in Western languages, and justifies markup. The ALIGN attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is added to a selection of elements where it makes sense (the block-like P, HR, H1 to H6, OL, UL, DIR, MENU, LI, BLOCKQUOTE and ADDRESS). If a user-agent chooses to have LEFT as a default for blocks of left-to-right directionality, it should use RIGHT for blocks of right-to-left directionality.

NOTE — RFC 1866 section 4.2.2 specifies that an HTML user agent should treat an end of line as a word space, except in preformatted text. This should be interpreted in the context of the script being processed, as the way words are separated in writing is script-dependent. For some scripts (e.g. Latin), a word space is just a space, but in other scripts (e.g. Thai) it is a zero-width word separator, whereas in yet other scripts (e.g. Japanese) it is nothing at all, i.e. totally ignored.
NOTE — the SOFT HYPHEN character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference &shy;. Its semantics are different from the plain HYPHEN: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.
In the DTD, the LANG and DIR attributes are grouped together in a parameter entity called attrs. To parallel RFC 1942 [RFC1942], the ID and CLASS attributes are also included in attrs. The ID and CLASS attributes are required for use with style sheets, and RFC 1942 defines them as follows:

ID
Used to define a document-wide identifier. This can be used for naming positions within documents as the destination of a hypertext link. It may also be used by style sheets for rendering an element in a unique style. An ID attribute value is an SGML NAME token. NAME tokens are formed by an initial letter followed by letters, digits, "-" and "." characters. The letters are restricted to A-Z and a-z.
CLASS
A space separated list of SGML NAME tokens. CLASS names specify that the element belongs to the corresponding named classes. It allows authors to distinguish different roles played by the same tag. The classes may be used by style sheets to provide different renderings as appropriate to these roles.
4.2.3. Cursive joining behaviour

Markup is needed in some cases to force cursive joining behavior in contexts in which it would not normally occur, or to block it when it would normally occur.

The zero-width joiner and non-joiner (&zwj; and &zwnj;) are used to control cursive joining behaviour. For example, ARABIC LETTER HEH is used in isolation to abbreviate Hijri (the Islamic calendrical system); however, the initial form of the letter is desired, because the isolated form of HEH looks like the digit five as employed in Arabic script. This is obtained by following the HEH with a zero-width joiner whose only effect is to provide context. In Persian texts, there are cases where a letter that normally would join a subsequent letter in a cursive connection does not. Here a zero-width non-joiner is used.

4.2.4. Bidirectional text

Many languages are written in horizontal lines from left to right, while others are written from right to left. When both writing directions are present, one talks of bidirectional text (BIDI for short). BIDI text requires markup in special circumstances where ambiguities as to the directionality of some characters have to be resolved. This markup affects the ability to render BIDI text in a semantically legible fashion. That is, without this special BIDI markup, cases arise which would prevent any rendering whatsoever that reflected the basic meaning of the text. Plain text may contain BIDI markup in the form of special-purpose formatting characters. This is also possible in HTML, which includes the five BIDI-related formatting characters (202A - 202E) of ISO 10646. As an alternative, HTML provides equivalent SGML markup.

BIDI is a complex issue, and conversion of logical text sequences to display sequences has to be done according to the algorithm and character properties specified in [UNICODE]. Here, explanations are given only as far as they are needed to understand the necessity of the features introduced and to define their exact semantics.

The Unicode BIDI algorithm is based on the individual characters of a text being stored in logical order, that is the order in which they are normally input and in which the corresponding sounds are normally spoken. To make rendering of logical order text possible, the algorithm assigns a directionality property to each character, e.g. Latin letters are specified to have a left-to-right direction, Arabic and Hebrew characters have a right-to-left direction.

The left-to-right and right-to-left marks (&lrm; and &rlm;) are used to disambiguate directionality of neutral characters. For example, when a double quote sits between an Arabic and a Latin letter, its direction is ambiguous; if a directional mark is added on one side such that the quotation mark is surrounded by characters of only one directionality, the ambiguity is removed. These characters are like zero width spaces which have a directional property (but no word/line break property).

Nested embeddings of contra-directional text runs, due to nested quotations or to the pasting of text from one BIDI context to another, is also a case where the implicit directionality of characters is not sufficient, requiring markup. Also, it is frequently desirable to specify the basic directionality of a block of text. For these purposes, the DIR attribute is used.

On block-type elements, the DIR attribute indicates the base directionality of the text in the block; if omitted it is inherited from the parent element. The default directionality of the overall HTML document is left-to-right.

On inline elements, it makes the element start a new embedding level (to be explained below); if omitted the inline element does not start a new embedding level.

NOTE — the PRE, XMP and LISTING elements admit the DIR attribute. Their contents should not be considered as preformatted with respect to bidirectional layout, but the BIDI algorithm should be applied to each line of text.
Following is an example of a case where embedding is needed, showing its effect:
Given the following latin (upper case) and arabic (lower case) letters in backing store with the specified embeddings:
<SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD </SPAN> zw </SPAN> EF </SPAN>
One gets the following rendering (with [] showing the directional transitions):
[ AB [ wz [ CD ] yx ] EF ]
On the other hand, without this markup and with a base direction of LTR one gets the following rendering:
[ AB [ yx ] CD [ wz ] EF ]
Notice that yx is on the left and wz on the right unlike the above case where the embedding levels are used. Without the embedding markup one has at most two levels: a base directional level and a single counterflow directional level.

The DIR attribute on inline elements is equivalent to the formatting characters LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT EMBEDDING (202B) of ISO 10646. The end tag of the element is equivalent to the POP DIRECTIONAL FORMATTING (202C) character.

Directional override, as provided by the BDO element, is needed to deal with unusual short pieces of text in which directionality cannot be resolved from context in an unambiguous fashion. For example, it can be used to force left-to-right (or right-to-left) display of part numbers composed of Latin letters, digits and Hebrew letters.

The effect of BDO is to force the directionality of all characters within it to the value of DIR, irrespective of their intrinsic directional properties. It is equivalent to using the LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO 10646, the end tag again being equivalent to the POP DIRECTIONAL FORMATTING (202C) character.

NOTE — authors and authoring software writers should be aware that conflicts can arise if the DIR attribute is used on inline elements (including BDO) concurrently with the use of the corresponding ISO 10646 formatting characters. Preferably one or the other should be used exclusively; the markup method is better able to guarantee document structural integrity, and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the 10646 characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override; otherwise, rendering results are undefined.

Back ToC Next