It is natural to expect input in any language in forms, as they provide one of the only ways of obtaining user input. While this is primarily a UI issue, there are some things that should be specified at the HTML level to guide behavior and promote interoperability.
To ensure full interoperability, it is necessary for the user agent (and the user) to have an indication of the character encoding(s) that the server providing a form will be able to handle upon submission of the filled-in form. Such an indication is provided by the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements, modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which contains a space and/or comma delimited list of character sets acceptable to the server. A user agent may want to somehow advise the user of the contents of this attribute, or to restrict his possibility to enter characters outside the repertoires of the listed character sets.
NOTE — The list of character sets is to be interpreted as an EXCLUSIVE-OR list; the server announces that it is ready to accept any ONE of these character encoding schemes for each part of a multipart entity. The client may perform character encoding translation to satisfy the server if necessary.
NOTE — The default value for the ACCEPT-CHARSET attribute of an INPUT or TEXTAREA element is the reserved valueUNKNOWN. A user agent may interpret that value as the character encoding scheme that was used to transmit the document containing that element.
The HTML 2.0 form submission mechanism, based on the
application/x-www-form-urlencoded
media type, is ill-equipped
with regard to internationalization. In fact, since URLs are
restricted to ASCII characters, the mechanism is akward even for
ISO-8859-1 text. Section 2.2 of
[RFC1738] specifies that octets may
be encoded using the %HH
notation, but text submitted from a
form is composed of characters, not octets. Lacking a specification
of a character encoding scheme, the %HH
notation has no
well-defined meaning.
The best solution is to use the multipart/form-data
media type
described in [RFC1867]
with the POST method of form submission.
This mechanism encapsulates the value part of each name-value pair in
a body-part of a multipart MIME body that is sent as the HTTP entity;
each body part can be labeled with an appropriate Content-Type,
including if necessary a charset parameter that specifies the
character encoding scheme. The changes to the DTD necessary to
support this method of form submission have been incorporated in the
DTD included in this specification.
A less satisfactory solution is to add a MIME charset parameter to the
application/x-www-form-urlencoded
media type specifier sent
along with a POST method form submission, with the understanding that
the URL encoding of [RFC1738]
is applied on top of the specified
character encoding, as a kind of implicit Content-Transfer-Encoding.
One problem with both solutions above is that current browsers do not generally allow for bookmarks to specify the POST method; this should be improved. Conversely, the GET method could be used with the form data transmitted in the body instead of in the URL. Nothing in the protocol seems to prevent it, but no implementations appear to exist at present.
How the user agent determines the encoding of the text entered by the user is outside the scope of this specification.
NOTE — Designers of forms and their handling scripts should be aware of an important caveat: when the default value of a field (the VALUE attribute) is returned upon form submission (i.e. the user did not modify this value), it cannot be guaranteed to be transmitted as a sequence of octets identical to that in the source document — only as a possibly different but valid encoding of the same sequence of text elements. This may be true even if the encoding of the document containing the form and that used for submission are the same.
Differences can occur when a sequence of characters can be represented by various sequences of octets, and also when a composite sequence (a base character plus one or more combining diacritics) can be represented by either a different but equivalent composite sequence or by a fully precomposed character. For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other equivalent composite sequences.