HTML Unleashed. Internationalizing HTML: Character Encoding Standards | WebReference

HTML Unleashed. Internationalizing HTML: Character Encoding Standards


HTML Unleashed: Internationalizing HTML

Character Encoding Standards


It so happened that the computer industry has been flourishing in the country whose language uses one of the most compact alphabets in the world.  However, not long after the first computers had learned to spell English, a need arose to encode characters from other languages.  In fact, even the minimum set of Latin letters and basic symbols has been for some time the subject of controversy between two competing standards, ASCII and (now almost extinct) EBCDIC; no wonder that for other languages' alphabets, a similar muddle has been around for much longer (in fact, it's still far from over).

As explained in Chapter 3, "SGML and the HTML DTD," a character encoding (often called character set or, more precisely, coded character set) is defined---first, by the numerical range of codes; second, by the repertoire of characters; and third, by a mapping between these two sets.  You see that the term "character set" is a bit misleading because it actually implies two sets and a relation between them.  Probably the most precise definition of a character encoding in mathematical terms is given by Dan Connolly in his paper "Character Set Considered Harmful": "A function whose domain is a subset of integers, and whose range is a set of characters."

The range of codes is limited by the length of the sequence of bits (called bit combination) used to encode one character.  For instance, a combination of 8 bits is sufficient to encode the total of 256 characters (although not all of these code positions may be actually used).  The smaller the bit combination size, the more compact the encoding (that is, the less storage space is required for a piece of text), but at the same time, the fewer total characters you can encode.

It is quite logical to codify characters using bit combinations of the size most convenient for computers.  Because modern computer architecture is based on bytes (also called octets) of 8 bits, all contemporary encoding standards use bit combinations of 8, 16, or 32 bits in length.  The next sections survey the most important of these standards to see the roles they play in today's Internet.




The so-called 7-bit ASCII, or US ASCII, encoding is equivalent to the international standard named ISO 646 established by the International Organization for Standardization (ISO).  This encoding actually uses octets of 8 bits per character, but it leaves the first (the most significant) bit in each octet unused (it must be always zero).  The 7 useful bits of ISO 646 are capable of encoding the total of 128 characters.

This is the most ubiquitous encoding standard used on the overwhelming majority of computers worldwide (either by itself or as a part of other encodings, as you'll see shortly).  ISO 646 may be called international in the sense that there are precious few computers in the world that use other encodings for the same basic repertoire of characters.  It is also used exclusively for keywords and syntax in all programming and markup languages (including SGML and HTML), as well as for all sorts of data that is human-editable but of essentially computer nature, such as configuration files or scripts.

However, with regard to the wealth of natural languages spoken around the world, ISO 646 is very restrictive.  In fact, only English, Latin, and Swahili languages can use plain 7-bit ASCII with no additional characters.  Most languages whose alphabets (also called scripts or writing systems) are based on the Latin alphabet use various accented letters and ligatures.

The first 32 codes of ISO 646 are reserved for control characters, which means that they invoke some functions or features in the device that reads the text rather than produce a visible shape (often called glyph) of a character for human readers.  As a rule, character set standards are reluctant to exactly define the functions of control characters, as these functions may vary considerably depending on the nature of text processing software.

For example, of the 32 control characters of ISO 646, only a few (carriage return, line feed, tabulation) have more or less established meanings.  For use in texts, most of these codes are just useless.  The code space thus wasted in vain is a hangover from the old days when these control characters used to play the role of today's document formats and communication protocols.


8-Bit Encodings


The first natural step to accommodate languages that are more letter-hungry than English is to make use of the 8th bit in every byte.  This provides for additional 128 codes that are sufficient to encode an alphabet of several dozens letters (for example, Cyrillic or Greek) or a set of additional Latin letters with diacritical marks and ligatures used in many European languages (such as ç in French or ß in German).

Unfortunately, there exist many more 8-bit encodings in the world than are really necessary.  Nearly every computer platform or operating system making its way onto a national market without a strong computer industry of its own brought along a new encoding standard.  For example, as many as three encodings for the Cyrillic alphabet are now widely used in Russia, one being left over from the days of MS-DOS, the second native to Microsoft Windows, and the third being popular in the UNIX community and on the Internet.  A similar situation can be observed in many other national user communities.

ISO, being an authoritative international institution, has done its best to normalize the mess of 8-bit encodings.  The ISO 8859 series of standards covers almost all extensions of the Latin alphabet as well as the Cyrillic (ISO 8859-5), Arabic (ISO 8859-6), Greek (ISO 8859-7), and Hebrew (ISO 8859-8) alphabets.  All of these encodings are backwards compatible with ISO 646; that is, the first 128 characters in each ISO 8859 code table are identical to 7-bit ASCII, while the national characters are always located in the upper 128 code positions.

Again, the first 32 code positions (128 to 159 decimal, inclusive) of the upper half in ISO 8859 are reserved for control characters and should not be used in texts.  This time, however, many software manufacturers chose to disregard the taboo; for example, the majority of True Type fonts for Windows conform to ISO 8859-1 in code positions from 160 upwards, but use the range 128-159 for various additional characters (notably the em dash and the trademark sign).  This leads to the endless confusion about whether one may access these 32 characters in HTML (the DTD, following ISO 8859, declares this character range UNUSED).  HTML internationalization extensions resolve this controversy by making it possible to address these characters via their Unicode codes.

The authority of ISO was not, however, sufficient to position all of the 8859 series as a strong alternative to the ad hoc national encodings supported by popular operating systems and platforms.  For example, ISO 8859-5 is hardly ever used to encode Russian texts except on a small number of computers.

On the other hand, the first standard in the 8859 series, ISO 8859-1 (often called ISO Latin-1), which contains the most widespread Latin alphabet extensions serving many European languages, has been widely recognized as the 8-bit ASCII extension.  Whenever a need arises for an 8-bit encoding standard that is as international as possible, you're likely to see ISO 8859-1 playing the role.  For instance, ISO 8859-1 served as a basis for the document character set in HTML versions up to 3.2 (in 4.0, this role was taken over by Unicode, see below).


16-Bit Encodings


Not all languages in the world use small alphabets.  Some writing systems (for example, Japanese and Chinese) use ideographs, or hieroglyphs, instead of letters, each corresponding not to a sound of speech but to an entire concept or word.  As there are many more words and conceivable ideas than there are sounds in a language, such writing systems usually contain many thousands of ideographs.  An encoding for such a system needs at least 16 bits (2 octets) per character which allows to accommodate the total of 216 = 65536 characters.

Ideally, such a 16-bit encoding should be backwards compatible with the existing 8-bit (and especially 7-bit ASCII) encodings.  This means that an ASCII-only device reading a stream of data in this encoding should be able to correctly interpret at least ASCII characters if they're present.  This is implemented using code switching, or escaping techniques: Special sequences of control characters are used to switch back and forth between ASCII mode with the 1 octet per character and 2-octet modes (also called code pages).  Encodings based on this principle are now widely used for Far East languages.

Code switching works all right, but one interesting problem is that the technique makes it ambiguous what to consider a coded representation of a character---is it just its 2-octet code or the code preceded by the switching sequence? It is obvious that the "extended" national symbols and ASCII characters are not treated equally in such systems, which may be practically justifiable but is likely to pose problems in the future.

In late 80s, the need for a truly international 16-bit coding standard became apparent.  The Unicode Consortium, formed in 1991, undertook to create such a standard called Unicode.  In Unicode, every character from the world's major writing systems is assigned a unique 2-octet code.  According to the tradition, the first 128 codes of Unicode are identical to 7-bit ASCII, and the first 256 codes, to ISO 8859-1.  However, strictly speaking, this standard is not backwards compatible with 8-bit encodings; for instance, Unicode for the Latin letter A is 0041 (hex) while ASCII code for the same letter is simply 41.

The Unicode standard deserves a separate book to describe it fully (in fact, its official specification is available in book form from the Unicode Consortium).  Its many blocks and zones cover all literal and syllabic alphabets that are now in use, alphabets of many dead languages, lots of special symbols and combined characters (such as letters with all imaginable diacritical marks, circled digits, and so on).

Also, Unicode provides space for more than 20 thousand unified ideographs used in Far East languages.  Contrary to other alphabets, ideographic systems were treated on a language-independent basis.  This means that an ideograph that has similar meanings and appearance across the Far East languages is represented by a single code despite the fact that it corresponds to quite different words in each of the languages and that most such ideographs have country-specific glyph variants.

The resulting ideographic system implemented in Unicode is often abbreviated CJK (Chinese, Japanese, Korean) after the names of the major languages covered by this system.  CJK unification reduced the set of ideographs to be encoded to a manageable (and codeable) number, but the undesirable side effect is that it is impossible to create a single Unicode font suitable for everyone; a Chinese text should be displayed using slightly different visual shapes of ideographs than a Japanese text even if they use the same Unicode-encoded ideographs.

The work on Unicode is far from complete, as about 34 percent of the total coding space remains unassigned.  Working groups in both the Unicode Consortium and ISO are working on selection and codification of the most deserving candidates to colonize Unicode's as-of-yet wastelands.  A good sign is that the process of Unicode acceptance throughout the computer industry is taking off; for example, Unicode is used for internal character coding in Java programming language and for font layout in Windows 95 and Windows NT operating systems.


ISO 10646


Although Unicode is still not widely used, ISO published in 1993 a new, 32-bit encoding standard named ISO/IEC 10646-1, or Universal Multiple-Octet Coded Character Set (abbreviated UCS).  Just as 7-bit ASCII does, though, this standard leaves the most significant bit in the most significant octet unused, which makes it essentially a 31-bit encoding.

Still, the code space of ISO 10646 spans the tremendous amount of 231 = 2147483648 code positions, which is much, much more than could be used by all languages and writing systems that ever existed on Earth.  What, then, is the rationale behind such a huge "Unicode of Unicodes?"

The main reason for developing a 4-octet encoding standard is that Unicode actually cannot accommodate all the characters for which it would be useful to provide encoding.  Although a significant share of Unicode codes are still vacant, the proposals for new character and ideograph groups that are now under consideration require in total several times more code positions than are available in 16-bit Unicode.

Extending Unicode thus seems inevitable, and it makes little sense to extend it by one octet because computers will have trouble dealing with 3-octet (24-bit) sequences; 32-bit encoding, on the other hand, is particularly convenient for modern computers, most of which process information in 32-bit chunks.

Just as Unicode extends ISO 8859-1, the new ISO 10646 is a proper extension of Unicode.  In terms of ISO 10646, a chunk of 256 sequential code positions is called a row, 256 rows constitute a plane, and 256 planes make up a group.  The whole code space is thus divided into 128 groups.  In such terms, Unicode is simply plane 00 of group 00, the special plane that in ISO 10646 standard is called the Basic Multilingual Plane (BMP).  For example, the Latin letter A (Unicode 0041) is in ISO 10646 fully coded 00000041.  As of now, ISO 10646 BMP is absolutely identical to Unicode, and it is unlikely that these two standards will ever diverge.

ISO 10646 specifies a number of intermediate formats that do not require using the codes in the canonical form of 4 octets per character.  For example, the UCS-2 (Universal Character Set, 2-octet format) is indistinguishable from Unicode as it uses 16-bit codes from the BMP.  The UTF-8 format (UCS Transformation Format, 8 bits) can be used to incorporate, with a sort of code switching technique, 32-bit codes into a stream consisting of mostly 7-bit ASCII codes.  Finally, the UTF-16 method was developed to access more than a million 4-octet codes from within a Unicode/BMP 2-octet data stream without making it incompatible with current Unicode implementations.

Most probably, ISO 10646 will be rarely used in its canonical 4- octet form.  For most texts and text-processing applications, wasting 32 bits per character is beyond the acceptable level of redundancy.  However, ISO 10646 is an important standard in that it establishes a single authority on the vast lands lying beyond Unicode, thus preventing the problem of incompatible multioctet encodings even before this problem could possibly emerge.


Created: Jun. 15, 1997
Revised: Jun. 16, 1997