HTML Unleashed. Internationalizing HTML: HTML Character Set | WebReference

HTML Unleashed. Internationalizing HTML: HTML Character Set


HTML Unleashed: Internationalizing HTML

HTML Character Set


Now that you are acquainted with various character encoding standards and the MIME-supplied method to indicate the standard in use, it's time to get to HTML and see how it is tweaked in version 4.0 to handle multilanguage data.


Document Character Set Versus External Character Encoding


First of all, an important distinction should be made.  Chapter 3 examines SGML declaration for HTML and, in particular, its CHARSET section.  This section defines the single document character set to be used by all conforming HTML documents.

On one hand, this makes the choice of the document character set fairly obvious: It should be itself international, which means Unicode or, better yet, ISO 10646.  Here's how the SGML declaration for HTML 4.0 defines the document character set (see Chapter 3 for syntax explanations):

   BASESET  "ISO 646:1983//CHARSET
             International Reference Version
             (IRV)//ESC 2/5 4/0"
   DESCSET  0   9   UNUSED
            9   2   9
            11  2   UNUSED
            13  1   13
            14  18  UNUSED
            32  95  32
            127 1   UNUSED
   BASESET  "ISO Registration Number 176//CHARSET
                ISO/IEC 10646-1:1993 UCS-2 with
                implementation level 3//ESC 2/5 2/15 4/5"
   DESCSET  128 32    UNUSED
            160 65375 160

Here, ISO 10646 is employed in one of its transformation formats, namely the UCS-2 which is a 2-octet format identical to Unicode.  RFC 2070 takes a more thoroughgoing approach and bases the document character set on the canonical 4-octet form of ISO 10646 only, without a reference to ISO 646 (which is a subset of ISO 10646 anyway) and with the upper limit of the code space raised to as much as 2147483646.

On the other hand, however, it is unrealistic to expect and fairly unreasonable to require that all HTML authors and browser manufacturers switch to Unicode in the next couple of months (or years, for that matter).  So how can we get the benefits of Unicode without making everybody change over to it?

RFC 2070 explains that this quandary is resolved by differentiating the document character set from the external character encoding of the document.  The external encoding is applied to the document when it's stored on a server and transferred through the network; this encoding can be arbitrary, provided that it is sufficient to encode the character repertoire of the document and that both server and user agent software are capable of handling it properly.

Upon receiving the document, the user agent software should convert it from external encoding to the document character set, so that further SGML processing and markup parsing is performed in this character set only.  Before displaying the parsed document, the user agent may recode it once again, for example, to comply to the encoding supported by the operating system (in order to call its display services) or to match the encoding of fonts that will be used for output.

Converting the document from external encoding to the SGML-specified document character set is done for two obvious purposes.  First, it is necessary to ensure that all characters that have special meaning in HTML, such as letters in element names and < and > characters, are correctly recognized (although it is unlikely, external encoding could remap some of these characters to other bit combinations).

And second, remember that users can invoke characters using character references (see Chapter 3), such as &169; for the copyright sign.  For these references to be unambiguous and not require changes when the document is recoded from one character set to another, it is declared that these explicit references always refer to the document character set---that is, Unicode.

This means that, for example, to access the CYRILLIC CAPITAL LETTER EF via a character reference, you should use its Unicode code, which yields &#1060;, regardless of what character encoding you work in when creating your document.  It doesn't matter whether you use KOI8-R 8-bit Cyrillic encoding in which this letter is coded 230 (decimal) or ISO 8859-1 that has no Cyrillic alphabet at all.  A compliant HTML parser should always resolve character references in terms of the document character set (Unicode) simply because, at the time of HTML-specific processing, the document should be already converted into the document character set.

Here are some advantages of using Unicode as the document character set and separating it from external encoding of a document:

  • This solution is backwards compatible to the previous versions of HTML standard.  Indeed, character references in, say, HTML 3.2 were supposed to refer to ISO 8859-1, which is a proper subset of Unicode.  (Whether a code is 8 or 16 bits makes no difference in this case because character references use decimal values of codes where padding zeroes can be dropped.)

  • The solution is also fairly flexible.  International HTML authors can continue using the character sets that are widely supported by software and that minimize overhead for their languages.  At the same time, they acquire the capability to directly access the entire character space of Unicode via character references.

  • Finally, implementing the technique should not be too bothersome for browser manufacturers.  The RFC 2070 standard does not even require user agents (browsers) to be able to display any Unicode character, but offers instead a number of workarounds for the cases when browser cannot generate a glyph for a particular Unicode code (for example, displaying the hexadecimal code or some special icon in place of the character).

Specifying External Character Encoding


For an HTML browser to correctly translate the received document from external encoding into the document character set, it must know the external encoding beforehand.  As of now, MIME is the only standard mechanism capable of communicating such information.  As described earlier in this chapter, the charset parameter is included in the Content-Type field that must be a part of any HTTP header, that is, must precede any document sent via HTTP protocol.  This field should also be included in the header of an e-mail message containing an HTML document.  Currently, there is no way to indicate character encoding for an HTML document retrieved via FTP or from a local or distributed file system.

Common browsers such as Netscape or Internet Explorer recognize the charset parameter and try to switch to the requested character set before displaying the document (in Netscape, for example, you can open the Options|Document Encoding submenu to see the list of supported character sets).  If no charset parameter is specified, ISO 8859-1 is assumed, and if it's not what the author planned for the document, the user will have to guess which encoding to switch to manually in order to read the document.  (It is not unreasonable to claim that the very possibility of manually switching character sets in common browsers is to blame for the abundance of web servers that never care to declare the character encoding of the documents they deliver.)

However, there's something more to character set negotiation.  HTTP protocol allows a client program to list a number of character encodings it can handle, in the order of preference, right in an HTTP request using the Accept-Charset field.  This enables the server to select the appropriate version of the document among those available or translate it to a requested character set on-the-fly.  The standard declares that if no Accept-Charset value is given in the request, the user agent thereby guarantees that it can handle any character set.  Unfortunately, the only browser (at this time) that allows a user to specify the Accept-Charset value to be inserted in HTTP requests is Lynx.

One more method to indicate the external character encoding of a document is by emulating the Content-Type header field in a META element.  For this, you should place the following tag within the HEAD block of your HTML document:

<META HTTP-EQUIV="Content-Type"
            CONTENT="text/html; charset=KOI8-R">
if you need to specify that your document is in KOI8-R Cyrillic encoding.  This is a handy choice for those who are unable or unwilling to change setup of the server that the document is stored on, but it has an obvious down side: Such a document, if automatically converted from one encoding to another, requires manually changing the <META> tag attributes.  The META encoding indication is supported by most browsers, but beware of a pitfall: Contrary to the standard stating that the charset value in HTTP header, if present, should override its META emulation, some browsers give preference to the META-supplied value.

Forms Internationalization


When browsing on the Web, you not only download textual information, but sometimes upload it as well using the forms mechanism of HTML.  Naturally, this mechanism needs adjustments to allow character set negotiation of the data submitted from the user agent software to the server.  This section covers the new features introduced to meet this requirement.

In HTML 4.0, the FORM tag is given an additional ACCEPT-CHARSET attribute that is similar to the Accept-Charset HTTP field mentioned in the preceding section.  The main difference is that the ACCEPT-CHARSET attribute in HTML works the other way around, specifying what character encodings the server is able to receive from the user.  The value of the ACCEPT-CHARSET attribute is a list of MIME identifiers for character encodings the server can handle, in order of preference; usually this list contains at least the external character encoding of the document itself.

A browser could make use of the value of the ACCEPT-CHARSET attribute in several ways:

  • A browser must configure the text input areas so that the text being typed in would display using appropriate glyphs.  This is a minimal level of support (it is implemented, for example, in Netscape Navigator 3.0, although this browser uses the main document encoding for this purpose instead of the ACCEPT-CHARSET attribute value), as it leaves the user with the main problem of how to input text properly.  If the operating system does not support the encoding, it may be necessary to use a specialized keyboard driver or copy and paste previously converted text.  In certain cases, an HTML author could provide a clue right in the document as to which encoding is accepted in a particular input field.

  • Better yet, a browser must take into account the character encoding supported by the operating system and convert, if it is possible (that is, if the encoding supported by the system and the encoding accepted by the server have identical character repertoires) and necessary (if these two encodings are not the same), the text typed in by the user before sending it out.  This makes the preceding item unnecessary, as the operating system itself takes care of the proper display of characters in text input areas, provided that they use the native encoding of the system.  This level of support is implemented in Microsoft Internet Explorer 4.0 and Netscape Navigator 4.0 (but here again, both these browsers ignore the ACCEPT-CHARSET value and consider the form charset the same as the document charset).

  • RFC 2070 suggests that a browser may restrict the range of characters that can be input in the text area in accordance with the encoding specified.  In my opinion, this is rather useless if not accompanied by one of the other two provisions.

The second part of the forms internationalization problem is how to submit the form data along with the information about its encoding.  For the first of the two submission methods, POST, MIME is helpful once again.  It is possible to add the charset parameter to the "Content-Type: application/x-www-form-urlencoded" header field that precedes any data sent with the POST method.

However, RFC 2070 gives preference to another technique that uses the "multipart/form-data" content type that was proposed in RFC 1867 for form-based file uploads.  (RFC 1867 provisions are also incorporated into HTML 4.0.) With this method, form data is not encapsulated in the form of a URL, and each name/value pair may have its own charset parameter attached.  Currently, this technique is not supported by common browsers.

With the other form submission method, GET, data is encapsulated right in the URL that the browser submits to the server.  In principle, URLs may contain any bit combinations provided that they are encoded using the %HH notation.  However, quoting RFC 2070, "text submitted from a form is composed of characters, not octets," and there's no easy way to incorporate information about the encoding of text data into an URL (other than by providing an additional input field that the user will need to manually set, which is pretty awkward).

RFC 2070 suggests that even with GET method, user agent software could send the data in the body of the HTTP request instead of the URL, although currently no applications support this technique.  Another solution with URLs might be using one of the special formats of ISO 10646; in particular, the UTF-8 format preserves all 7-bit ASCII characters as is and encodes any non-ASCII characters using only the octets with the most significant bit set, i.e. those outside the ASCII range.  This makes UTF-8 completely backwards compatible with the URL syntax.  Because ISO 10646 is a superset of all other character encodings, a string in such a format doesn't require any further charset specifications (provided that, of course, the server is aware of using UTF-8).


Real-World Character Sets Problems


In fact, differentiating the document character set from external encoding is nothing really new in HTML.  Any numerical character references in a document conforming to HTML 3.2 or an earlier version must refer to the characters from the Latin-1 (ISO 8859-1) set regardless of the external character set of the document.  Unfortunately, this convention is ignored by many contemporary browsers, which leads to undesirable (although, admittedly, not too serious in the case of HTML 3.2 without internationalization extensions) consequences.

For instance, the KOI8-R character encoding as defined in RFC 1489 specifies code 191 (decimal) for the COPYRIGHT SIGN character.  In ISO 8859-1, the same symbol is coded 169.  Ideally, when a mnemonic entity &copy; or character reference &169; (which is what &copy; expands to, as defined by HTML DTD) is used in a KOI8-R document, browser must resolve it with regard to ISO 8859-1 character set and display the copyright sign (for example, by accessing code position 191 in a KOI8-R font).

However, as most browsers are incapable of remembering anything about the ISO 8859-1 character set after being switched to KOI8-R or whatever external encoding is used for a document, an HTML author cannot rely any more on the table of Latin-1 mnemonic character entities.  These entities or numeric character references are guaranteed to work only if the document itself is created (and viewed) in ISO 8859-1.

As a sort of a workaround, creators of several KOI8-R Cyrillic fonts for use on the Web chose to move the copyright sign from the standard-prescribed code 191 to the Latin-1-inspired 169.  As Alan Flavell of CERN has put it, "Breaking your font in order to help a broken browser is a bad idea." It is obvious that, with the internationalized HTML gaining wide recognition, the problem may become more severe, as Unicode character references in conforming documents are much more likely to go out of sync with the external character encoding of a document.

In fact, support for nonstandard document encodings in browsers such as Netscape Navigator 3.0 is reduced to the capability to switch display fonts, in response to either the charset parameter in HTTP header or the user's having selected a command---and little else.  As a result, Netscape Navigator 3.0 cannot display Russian texts in KOI8-R without KOI8-R Cyrillic fonts installed, even if it's working under a Russian version of Windows that provides Cyrillic fonts in Windows encoding.

There are still more problems related to document character encoding that many common browsers are unable to cope with, and that HTML authors should therefore beware of:

  • Even when the text of a document is correctly displayed, its title, if it contains encoding-specific characters, may appear broken in the window title bar (apparently because the font used in window title bars is determined by the operating system, which may be completely unaware of the encoding of the document).

  • ALT texts in place of inline images, as well as button labels in forms, may not display correctly if they contain encoding-specific characters (again, the reason is that many browsers use a system-provided font for these purposes).

  • Text-oriented Java applets in Java-enabled browsers may have problems with displaying text in a nonstandard encoding.

Created: Jun. 15, 1997
Revised: Jun. 16, 1997