Preparing for XML. XML Scenarios | WebReference

Preparing for XML. XML Scenarios

  XML Scenarios

The recent XML buzz in Internet media can easily perplex an inexperienced user.  One of the biggest myths about the new language is that you can easily and automatically convert valid HTML to XML.  To be sure you can, and the result will pass validation as well-formed or even valid XML.  But what's the worth of such a conversion?

What XML is all about are not syntactic innovations such as quotes around attribute values and trailing slashes in empty tags.  XML's goal is to comprehensively mark up all details of a given unit of information, without mixing data belonging to different units or different aspects of one unit.  From this viewpoint, a tag-wise conversion of "real world" HTML, with its hopeless medley of logical and visual elements, to XML doesn't make any sense at all.  On most sites, HTML bears little relation not only to the logical structure of pages, but, properly speaking, to the presentation aspect as well: It does not describe formatting of the pages, but only emulates it by using tables, invisible spacers and similar hacks.

So, we have to forget about the XML promise for now---until we take the trouble of re-formulating all our data consistently, be it in its presentation or (more importantly) content aspects.  This is well known to those who take XML seriously and are aware of what it can offer, and a growing number of new document collections and software tools are being built from ground up using XML-inspired approaches.  However, the huge legacy of existing HTML documents needs special treatment.

As you may have guessed, modular HTML is an essential transition stage on the way to XML.  Just as you can update all instances of one module throughout the site by a global search-and-replace, you can use the same technique to replace your HTML modules with logical XML tags.  For example, the XML for the above heading module could look like this:


This expression is not only correct XML, but most importantly, it perfectly fits into the ideology of generalized markup, as all traces of presentation machinery are eliminated and what remains is a purely logical declaration stating what this element is, not how it is formatted.  (Admittedly, the name of the XML element, FRAMED-HEADING, was coined in acknowledgement of its intended visual presentation, but this is done only to preserve same consistency in the source markup, while actual formatting of this element may, with time, deviate quite far from the original.)

Note in particular that the text part of the heading is kept unchanged in the conversion except for one difference: in HTML, the heading was in all caps while in XML it is in the conventional initial caps form.  The all caps spelling of HTML is dictated by purely visual considerations, therefore this aspect was deemed irrelevant in the purely logical XML markup.  The stylesheet to be attached to this document will have to take care, among other things, of capitalizing the content of all FRAMED-HEADING elements for display.


Created: Sept. 17, 1998
Revised: Sept. 28, 1998