HTML Unleashed. Internationalizing HTML: Language Identification | WebReference

HTML Unleashed. Internationalizing HTML: Language Identification


HTML Unleashed: Internationalizing HTML

Language Identification


Character set problems constitute only a part of the whole HTML internationalization issue.  Almost equally important is the problem of language identification of a document.  Lots of aspects of document presentation depend not only on the character set, but also on the language of the text.

For example, as I've mentioned before, the same ideographs are used in many Far East languages, so that in each language they are rendered by slightly different glyphs and quite different sounds of speech.  Also, different languages using the same character set may differ greatly in respect to hyphenation, spacing, use of punctuation, and so on.

To this end, HTML 4.0 introduces the new LANG attribute, which can be used with most HTML elements to describe the language of the element contents.  A "language" in this context is defined as "spoken (or written) by human beings for communication of information to other human beings; computer languages are explicitly excluded." For example:

<P LANG="fr">Ce paragraphe est en Français</P>

The LANG attribute may take as a value a two-letter abbreviated code (or tag) of the language.  A list of these codes is defined by ISO 639 standard; these codes should not be confused with country codes (for example, uk as a language code means Ukrainian, not United Kingdom).

Also, extended identifiers may be used to designate different dialects or writing systems of a language, identify the country in which it is used, and so forth.  These extended identifiers are based on two-letter codes with the addition of subtags separated by a hyphen (-), for example:

English language of the USA (two-letter subtags are always interpreted as country codes)

Nynorsk variant of Norwegian

Azerbaijani language written in Cyrillic script

A registry of such extended language identifiers is maintained by IANA.  All LANG values are case insensitive; their complete syntax is defined by RFC 1766.  Another useful resource is the document where most known languages are listed along with the character sets they use.


Created: Jun. 15, 1997
Revised: Jun. 16, 1997