Effective XML: 50 Specific Ways to Improve Your XML | WebReference

Effective XML: 50 Specific Ways to Improve Your XML

Effective XML: 50 Specific Ways to Improve Your XML

This book excerpt is from Elliotte Rusty Harold's "Effective XML: 50 Specific Ways to Improve Your XML;" ISBN 0321150406. All rights reserved. Item 3, Stay with XML 1.0 is posted with permission Addison-Wesley.

Item 3, Stay with XML 1.0

Everything you need to know about XML 1.1 can be summed up in two rules.

    1. Don't use it.
    2. (For experts only) If you speak Mongolian,Yi, Cambodian, Amharic, Dhivehi, Burmese, or a very few other languages and you want to write your markup (not your text but your markup) in these languages, you can set the version attribute of the XML declaration to 1.1. Otherwise, refer to rule 1.
XML 1.1 does several things, one of them marginally useful to a few developers, the rest actively harmful.

  • It expands the set of characters allowed as name characters.
  • The C0 control characters (except for NUL) such as form feed, verti-cal tab, BEL, and DC1 through DC4 are now allowed in XML text provided they are escaped as character references.
  • The C1 control characters (except for NEL) must now be escaped as character references.
  • NEL can be used in XML documents but is resolved to a line feed on parsing.
  • Parsers may (but do not have to) tell client applications that Unicode data was not normalized.
  • Namespace prefixes can be undeclared.
Let's look at these changes in more detail.

New Characters in XML Names

XML 1.1 expands the set of characters allowed in XML names (that is, element names, attribute names, entity names, ID-type attribute values, and so forth) to allow characters that were not defined in Unicode 2.0, the version that was extant when XML 1.0 was first defined. Unicode 2.0 is fully adequate to cover the needs of markup in English, French, German, Russian, Chinese, Japanese, Spanish, Danish, Dutch, Arabic, Turkish, Hebrew, Farsi, Thai, Hindi, and most other languages you’re likely to be familiar with as well as several thousand you aren’t. However, Unicode 2.0 did miss a few important living languages including Mongolian, Yi, Cambodian, Amharic, Dhivehi, and Burmese, so if you want to write your markup in these languages, XML 1.1 is worthwhile.

However, note that this is relevant only if we’re talking about markup, particularly element and attribute names. It is not necessary to use XML 1.1 to write XML data, particularly element content and attribute values, in these languages. For example, here’s the beginning of an Amharic translation of the Book of Matthew written in XML 1.0.

Here the element and attribute names are in English although the content and attribute values are in Amharic. On the other hand, if we were to write the element and attribute names in Amharic, we would need to use XML 1.1.

This is plausible. A native Amharic speaker might well want to write markup like this. However, the loosening of XML’s name character rules have effects far beyond the few extra languages they’re intended to enable. Whereas XML 1.0 is conservative (everything not permitted is forbidden), XML 1.1 is liberal (everything not forbidden is permitted). XML 1.0 lists the characters you can use in names. XML 1.1 lists the characters you can’t use in names. Characters XML 1.1 allows in names include:

  • Symbols like the copyright sign (©)
  • Mathematical operators such as ±
  • Superscript 7 ( 7 )
  • The musical symbol for a six-string fretboard
  • The zero-width space
  • Private-use characters
  • Several hundred thousand characters that aren’t even defined in Unicodeand probably never will be

XML 1.1’s lax name character rules have the potential to make documents much more opaque and obfuscated.

Created: March 27, 2003
Revised: October 25, 2003

URL: http://webreference.com/programming/xml/1