HTML Unleashed. The Emergence of XML: XML DTDs and Valid XML Documents | WebReference

HTML Unleashed. The Emergence of XML: XML DTDs and Valid XML Documents


HTML Unleashed: The Emergence of XML

XML DTDs and Valid XML Documents


Although in many cases well-formed XML documents are sufficient for practical purposes, designing a DTD for your document has a number of advantages:

  • First and foremost, a DTD allows an XML parser to validate your document (that is why such documents are called valid).  When validating, the parser checks for misspelled tags or attributes, for errors in types of attribute values and in elements' content models, and so on.  For HTML, similar validation services exist that will check your file against one of the existing HTML DTDs.

  • For human reader, a DTD is a convenient way to quickly learn the structure of the particular type of documents.  Compared to SGML, the simplified DTD syntax of XML is very straightforward and unambiguous.

  • With DTD, you can define not only elements and their attributes, but also entities.  (See "Entity Declarations," later in this chapter.) Similarly to macros in word processors or #define preprocessor instructions in C, entities can be used to abbreviate text strings and markup instructions in an obvious and easy-to-modify manner.  Also, you can use external entities to refer to other XML documents, DTDs, or binary data located in separate files.

Accessing the DTD


Let's examine an example of a valid XML document, namely a play by Shakespeare (The Tempest) marked up by Jon Bosak, one of the authors of XML.  The package includes, besides the XML document and its DTD, a DSSSL style sheet that contains formatting instructions for each element and a Postscript output of a DSSSL processor that formatted the play.

Here's the very beginning of the XML document play.xml:

<?XML version="1.0"?>
       "-//Free Text Project//DTD Play//EN">

The first line here is an XML declaration, a special instruction that is XML-specific and would be ignored by an SGML parser.  Here, the XML declaration provides information about the version of XML standard that the document conforms to.

Next comes the DOCTYPE statement that, like its namesake in SGML, provides the DTD for the document to be parsed.  In XML, a DTD may be in two parts: internal is contained in the document file itself while external is referenced by its URL or public identifier, with the internal part taking precedence over the external one in a case of conflict.

In our example, only the external part of DTD is present, which is referred to by the public identifier preceded by the keyword PUBLIC.  An XML parser is supposed to be able to retrieve the text of the DTD using its public identifier (that is, to translate the identifier into an URL or some other sort of physical address).  If the DTD you're using is not assigned a well-known public identifier, you should provide an URL instead of it, with the SYSTEM keyword instead of PUBLIC.  For instance:


Finally, to provide an internal part for a DTD, you must put it in brackets within the DOCTYPE declaration.  Such a declaration may also contain a SYSTEM or PUBLIC external reference, for example:

  <!-- your DTD goes here -->

Element Declarations


The name right after the DOCTYPE keyword in the preceding statements is the name of the root element of your document type, the top level element that encloses all other elements.  In HTML, this element is named HTML, and in our Shakespearean example it is named PLAY.  Here's how the PLAY element is defined in play.dtd:

<!ELEMENT PLAY (title, fm, personae,
              scndescr, PLAYsubt, induct?,
              prologue?, act+, epilogue?)>

You can see that the content model for this element is quite simple and immediately translatable into human talk: "A PLAY is formed by its TITLE, followed by the front matter (FM), followed by the list of dramatis PERSONAE, and so on."  The question mark indicates optional elements, and the plus sign, the elements that may occur once or more.  Note that the XML spec prescribes to drop the SGML minimization parameters that are useless in XML, which doesn't permit tag omission anyway.

One more excerpt from PLAY.dtd shows a hierarchical set of related tags to mark a personage's speech:

<!ELEMENT speech   (speaker+,
                   (line | stagedir | subhead)+)>
<!ELEMENT speaker  (#PCDATA)>
<!ELEMENT line     (stagedir | #PCDATA)+>
<!ELEMENT stagedir (#PCDATA)>
<!ELEMENT subhead  (#PCDATA)>

Thus a SPEECH is constituted by one or more SPEAKER elements followed by at least one of the LINE, STAGEDIR (stage direction), or SUBHEAD elements, in no particular order (the "|" sign means that any one of connected particles may occur).  The #PCDATA keyword has the meaning of "any character data without tags"; thus, the SPEAKER, STAGEDIR, and SUBHEAD elements are allowed to contain only text characters while a LINE may have STAGEDIRs intermingled with text.

Note that nothing in the definition of LINE (except the name) suggests that what the element contains is really a line of verse.  It is just implied to be so by the person who did markup and it may be formatted as a line if an appropriate style sheet is used.  However, XML only serves as an intermediator between the author and the formatter, and is not intended to describe the nature of data elements that are marked up with it.

Here's a SPEECH element exemplifying these DTD provisions:

<LINE><STAGEDIR>Aside</STAGEDIR> The Duke of Milan</LINE>
<LINE>And his more braver daughter could control thee,</LINE>
<LINE>If now 'twere fit to do't. At the first sight</LINE>
<LINE>They have changed eyes. Delicate Ariel,</LINE>
<LINE>I'll set thee free for this.</LINE>
<LINE>A word, good sir;</LINE>
<LINE>I fear you have done yourself some wrong: a word.</LINE>

Entity Declarations


Entities can be declared in a DTD as follows:

<!ENTITY me "Dmitry Kirsanov, 
                St.Petersburg, Russia">

In the document, such an entity can be used similarly to mnemonic character entities of HTML:

This document was created by &me; 
                             on Apr 21, 1997

Another syntax is used to define entities that refer to external files or documents.  For example:

<!ENTITY xml-logo SYSTEM
   "" NDATA gif>

In the second declaration, gif is the name of a notation (similar to a data type), which must be declared somewhere in the DTD along with information on where an XML processor can access a helper software capable of handling data in this notation.

Now, &mypage; and &xml-logo; entities can be used in documents using this DTD.  However, XML specification does not prescribe the exact behavior of XML application on encountering such an entity.  For example, it may incorporate it into the text of the current document or it may present it as a link that the user can activate.


Created: Jun. 15, 1997
Revised: Jun. 16, 1997