HTML Unleashed. Internationalizing HTML: Language-Specific Presentation Markup | WebReference

HTML Unleashed. Internationalizing HTML: Language-Specific Presentation Markup


HTML Unleashed: Internationalizing HTML

Language-Specific Presentation Markup


In a multilanguage environment, a need may arise to specify in HTML some aspects of text presentation, such as the writing direction (left to right or right to left), punctuation peculiarities, and so on.  These aspects usually can be derived from the language of the text (see preceding section), but sometimes one may need to specify this information without specifying the language or to override the language default values.  Also, some presentation aspects (such as quotation marks) require additional markup even if a language is specified.

RFC 2070 introduces and HTML 4.0 adopts a whole bouquet of new HTML elements, attributes, and entities for this sort of presentation markup.  These new features are summarized in the following sections.


Writing Direction


While most Western languages are written from left to right, such languages as Arabic and Hebrew are written from right to left.  In situations when such text is intermingled with the text of the opposite direction (resulting in a bidirectional, or BIDI, text), a special markup may be necessary to resolve ambiguity.

Unicode standard has a number of direction-related provisions.  Each Unicode character is assigned the bidirectional category parameter that may take a number of different values, such as left-to-right, right-to-left, number separator, or neutral (for example, white space).  Some characters (such as parentheses) are marked as mirrored depending on the text direction (in right-to-left text, an opening parenthesis should take the appearance of a closing one and vice versa).

To support this behavior, RFC 2070 introduces directional markup tools of three types.  The first type consists of the left-to-right and right-to-left marks that behave exactly as zero-width spaces having corresponding direction properties.  These marks are taken directly from Unicode inventory, so in HTML they are implemented as entities expanding into corresponding Unicode characters:

<!ENTITY lrm  CDATA "&#8206;" -- left-to-right mark -->
<!ENTITY rlm  CDATA "&#8207;" -- right-to-left mark -->

Direction marks may be used when, for example, a double quote (which doesn't have a direction of its own, but is not a mirrored character either) sits between a Latin and a Hebrew character; in this situation, the actual place of the quote depends on whether it is assumed to belong to the left-to-right or the right-to-left text stream.  By placing an invisible direction mark (&lrm; or &rlm;) on one side of the quote, you can ensure that the quote is surrounded by characters of the same directionality, thereby resolving the ambiguity.

The second type of direction markup is represented by the new DIR attribute that, like LANG, can be used with nearly all HTML tags to indicate the writing direction of the text in the element's contents.  Sometimes you may need to indicate the basic writing direction of a piece of text; also, explicit direction markup is critical when there are two or more levels of nested contra-directional text (for an example, refer to RFC 2070).

The two possible values for the DIR attribute are strings rtl (right-to-left) and ltr (left-to-right).  As is the case with CSS attributes, you can use the DIR attribute when no element is normally discriminated by using the SPAN element as a sort of a neutral container.  If the DIR attribute is omitted, the element inherits the writing direction of its parent element.  The entire HTML document's default direction is left to right.

For brevity, definitions of the DIR and LANG attributes are packed into one parameter entity in the HTML 4.0 DTD:

<!ENTITY % i18n
"lang  NAME       #IMPLIED  -- RFC 1766 language value --
dir    (ltr|rtl)  #IMPLIED  -- default directionality --"

Later, the %i18n; entity is added to the ATTLIST declarations for the majority of HTML elements.

Finally, the third type of direction markup is represented by the new phrase-level BDO element (BDO stands for BiDirectional Override).  It is used when a mix of left-to-right and right-to-left characters should be displayed in a single direction, overriding the intrinsic directional properties of the characters.  For the BDO element, DIR is the only obligatory attribute.


Cursive Joining Behavior


In some writing systems (most notably Arabic), a letter's glyph may be different depending on the context---that is, on whether the letter is preceded or followed by some other letters.  Arabic letters are modeled after handwritten cursive prototypes, so a letter in a middle of a word is drawn joined to its neighbors and therefore may look quite different than it does when it is isolated.

As a rule, software capable of displaying Arabic handles these differences automatically.  But sometimes it's necessary to control the joining behavior, for example, to exemplify a standalone letter with cursive joiners.  For this, Unicode provides two special characters, both being invisible and having zero width, the first to force joining of adjacent characters where normally no joining would occur, and the second to prevent joining that would normally take place.  HTML 4.0 provides means to access these characters in HTML via the &zwj; and &zwnj; mnemonic character entities:

<!ENTITY zwnj CDATA "&#8204;" -- zero width non-joiner -->
<!ENTITY zwj  CDATA "&#8205;" -- zero width joiner     -->

Quotation Marks


A number of different styles exist to render quotation marks around short, in-text quotations.  Although the English language always uses quotes “like this,” French has  comme ça », and German prefers „wie hier“.  Moreover, nested comments sometimes use different styles; for example, Russian tradition uses French quotes (without separating spaces) on the upper level and German quotes for quotations within quotations.  Finally, it is desirable to be able to render the same text with rich quotes (“”) in a graphics environment but with plain double quotes of 7-bit ASCII ("") in text-mode browsers.

To account for these differences, HTML 4.0 offers the new phrase-level Q element whose content is surrounded by a pair of quotation marks rendered in accordance with the language of the text, the level of nesting, and the display capabilities available.  For example:

<P LANG="en">The English language always uses 
quotes <Q>like this</Q>,
French has <Q LANG="fr">comme &ccedil;a</Q>,
and German prefers <Q LANG="de">wie hier</Q>.</P>

Unfortunately, this solution is not backwards compatible; most existing software will just ignore Q tags without displaying even the plain ASCII quotes, which can often damage the meaning of the text.  Thus, practical use of Q elements is not encouraged until the majority of user agent software provides support for the feature.


Alignment and Hyphenation


Traditions of using text justification modes in other languages may be quite different from those of English.  That is why RFC 2070 introduces the optional ALIGN attribute that may be used with most block-level elements (namely P, HR, H1 to H6, OL, UL, DIV, MENU, LI, BLOCKQUOTE, and ADDRESS) with the values of left, right, center, and justify.  RFC 2070 suggests that the default ALIGN value for texts with left-to-right writing direction should be left, and for right-to-left texts, right.

This is a significant improvement over HTML 3.2, where the list of elements supporting this attribute is shorter (only DIV, H1 to H6, HR, TD, and P) and the value "justify" is not allowed.  Judging from the DTD, HTML 4.0 takes a halfway approach: it adopts the justify option but leaves the list of elements accepting the ALIGN attribute the same as in HTML 3.2.

As for hyphenation, user agents are supposed to apply language-dependent rules to break words if this is necessary for proper display.  In complex or critical cases, RFC 2070 suggests that HTML authors use the mnemonic entity &shy; that invokes the SOFT HYPHEN character present in Unicode as well as all of the ISO 8859 family and other character sets.

This invisible character marks the point where a word break can occur; if the word is indeed broken, the character is visualized as a usual hyphen (-) character.  Unfortunately, common browsers do not implement this behavior; what's worse, both Netscape Navigator and Microsoft Internet Explorer always display a - in place of a soft hyphen, thus preventing you from using this character whatsoever.

For better hyphenation control, the new HYPH element was proposed that is capable of handling complex cases when breaking a word is accompanied by a change in its spelling (for example, the German word backen becomes bak-ken when hyphenated).  However, the HYPH element was not included in either RFC 2070 or HTML 4.0.


Created: Jun. 15, 1997
Revised: Jun. 16, 1997