Document Type Definition (DTD)
Document Type Definition (DTD) is one of the original methods for defining the structure and constraints of XML documents. While XML Schema (XSD) has largely superseded DTD for new applications, DTD remains important for understanding legacy systems and certain XML specifications that still rely on it.
DTD vs XML Schema
Feature | DTD | XML Schema (XSD) |
---|---|---|
Syntax | Non-XML syntax | XML-based syntax |
Data Types | Limited (CDATA, ID, IDREF, etc.) | Rich type system |
Namespace Support | Limited | Full support |
Extensibility | Limited | Excellent |
Learning Curve | Simpler | More complex |
DTD Declaration
DTDs can be declared internally within an XML document or externally in separate files:
Internal DTD
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library [
<!ELEMENT library (book+)>
<!ELEMENT book (title, author, isbn)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ATTLIST book id ID #REQUIRED>
]>
<library>
<book id="book1">
<title>Learning XML</title>
<author>John Doe</author>
<isbn>978-1234567890</isbn>
</book>
</library>
External DTD
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library SYSTEM "library.dtd">
<library>
<!-- XML content -->
</library>
library.dtd:
<!ELEMENT library (book+)>
<!ELEMENT book (title, author, isbn)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ATTLIST book id ID #REQUIRED>
Element Declarations
Basic Element Syntax
<!ELEMENT element-name (content-model)>
Content Models
Empty Elements
<!ELEMENT br EMPTY>
Text Content Only
<!ELEMENT title (#PCDATA)>
Element Content Only
<!ELEMENT book (title, author, price)>
Mixed Content
<!ELEMENT description (#PCDATA | emphasis | strong)*>
Content Operators
Sequence (,)
Elements must appear in order:
<!ELEMENT book (title, author, isbn)>
Choice (|)
One of the elements must appear:
<!ELEMENT contact (email | phone | address)>
Occurrence Indicators
<!-- Exactly one (default) -->
<!ELEMENT book (title)>
<!-- Zero or one (optional) -->
<!ELEMENT book (title, subtitle?)>
<!-- Zero or more -->
<!ELEMENT book (title, author*)>
<!-- One or more -->
<!ELEMENT library (book+)>
Attribute Declarations
Basic Attribute Syntax
<!ATTLIST element-name
attribute-name attribute-type default-value>
Attribute Types
CDATA
Any character data:
<!ATTLIST book title CDATA #IMPLIED>
ID and IDREF
Unique identifiers and references:
<!ATTLIST book id ID #REQUIRED>
<!ATTLIST reference bookref IDREF #REQUIRED>
NMTOKEN and NMTOKENS
Name tokens:
<!ATTLIST element class NMTOKEN #IMPLIED>
<!ATTLIST element classes NMTOKENS #IMPLIED>
Enumerated Values
<!ATTLIST book format (hardcover | paperback | ebook) "paperback">
<!ATTLIST book language (en | es | fr | de) #REQUIRED>
Default Value Types
#REQUIRED
Attribute must be present:
<!ATTLIST book id ID #REQUIRED>
#IMPLIED
Attribute is optional:
<!ATTLIST book subtitle CDATA #IMPLIED>
#FIXED
Attribute has a fixed value:
<!ATTLIST book version CDATA #FIXED "1.0">
Default Value
Provides a default if not specified:
<!ATTLIST book language CDATA "en">
Complete DTD Example
books.dtd:
<?xml version="1.0" encoding="UTF-8"?>
<!-- Root element -->
<!ELEMENT library (metadata?, book+, author*)>
<!-- Metadata element -->
<!ELEMENT metadata (name, location, established?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ELEMENT established (#PCDATA)>
<!-- Book element -->
<!ELEMENT book (title, author-ref+, isbn, publisher?, price?, description?)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author-ref EMPTY>
<!ELEMENT isbn (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT description (#PCDATA | emphasis | strong)*>
<!-- Text formatting elements -->
<!ELEMENT emphasis (#PCDATA)>
<!ELEMENT strong (#PCDATA)>
<!-- Author element -->
<!ELEMENT author (name, biography?)>
<!ELEMENT biography (#PCDATA)>
<!-- Attributes -->
<!ATTLIST library
id ID #REQUIRED
version CDATA #FIXED "2.0">
<!ATTLIST book
id ID #REQUIRED
genre (fiction | non-fiction | science | history | biography) #REQUIRED
format (hardcover | paperback | ebook) "paperback"
available (yes | no) "yes">
<!ATTLIST author-ref
ref IDREF #REQUIRED>
<!ATTLIST author
id ID #REQUIRED
nationality CDATA #IMPLIED>
<!ATTLIST price
currency (USD | EUR | GBP | JPY) "USD">
Sample XML Document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library SYSTEM "books.dtd">
<library id="central-lib" version="2.0">
<metadata>
<name>Central Public Library</name>
<location>Downtown</location>
<established>1952</established>
</metadata>
<book id="book1" genre="fiction" format="hardcover">
<title>To Kill a Mockingbird</title>
<author-ref ref="author1"/>
<isbn>978-0-06-112008-4</isbn>
<publisher>Harper Lee</publisher>
<price currency="USD">14.99</price>
<description>A classic novel exploring themes of
<emphasis>racial injustice</emphasis> and
<strong>moral growth</strong>.</description>
</book>
<book id="book2" genre="science">
<title>A Brief History of Time</title>
<author-ref ref="author2"/>
<isbn>978-0-553-10953-5</isbn>
<price currency="USD">16.99</price>
</book>
<author id="author1" nationality="American">
<name>Harper Lee</name>
<biography>American novelist known for To Kill a Mockingbird.</biography>
</author>
<author id="author2" nationality="British">
<name>Stephen Hawking</name>
<biography>Theoretical physicist and cosmologist.</biography>
</author>
</library>
Entity Declarations
Internal Entities
<!ENTITY copyright "Copyright 2023 Library Systems Inc.">
<!ENTITY contact-email "[email protected]">
Usage in XML:
<footer>©right; Contact: &contact-email;</footer>
External Entities
<!ENTITY legal SYSTEM "legal-notice.txt">
<!ENTITY logo SYSTEM "logo.gif" NDATA gif>
Parameter Entities
Used within DTD for modularity:
<!ENTITY % text-elements "emphasis | strong | code">
<!ELEMENT description (#PCDATA | %text-elements;)*>
Conditional Sections
For creating modular DTDs:
<!ENTITY % include-images "INCLUDE">
<![%include-images;[
<!ELEMENT image EMPTY>
<!ATTLIST image
src CDATA #REQUIRED
alt CDATA #REQUIRED>
]]>
DTD Validation
Command Line Validation
# Using xmllint
xmllint --valid --noout document.xml
# Using xmlstarlet
xmlstarlet val -d books.dtd library.xml
Programmatic Validation
Java
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
public void error(SAXParseException e) throws SAXException {
System.err.println("Validation error: " + e.getMessage());
}
// ... other methods
});
Document doc = builder.parse(new File("library.xml"));
Python (lxml)
from lxml import etree
# Parse DTD
with open('books.dtd', 'r') as dtd_file:
dtd = etree.DTD(dtd_file)
# Parse XML
with open('library.xml', 'r') as xml_file:
xml_doc = etree.parse(xml_file)
# Validate
if dtd.validate(xml_doc):
print("Document is valid")
else:
print("Validation errors:")
for error in dtd.error_log:
print(f"Line {error.line}: {error.message}")
Best Practices
DTD Design
- Keep it simple: DTD syntax is limited, don't over-complicate
- Use meaningful names: Element and attribute names should be descriptive
- Document structure: Add comments to explain complex content models
- Modularize: Use parameter entities for reusable components
Performance Considerations
- External DTDs: Cache DTD files to avoid repeated network requests
- Entity references: Minimize complex entity structures
- Validation timing: Consider when validation is necessary
Migration Strategy
If moving from DTD to XML Schema:
<!-- DTD version -->
<!ELEMENT book (title, author)>
<!ATTLIST book id ID #REQUIRED>
<!-- XML Schema equivalent -->
<xs:element name="book">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="author" type="xs:string"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
</xs:element>
Common Pitfalls
Mixed Content Issues
<!-- Problematic: Hard to control -->
<!ELEMENT description (#PCDATA | emphasis | strong)*>
<!-- Better: More structured -->
<!ELEMENT description (paragraph+)>
<!ELEMENT paragraph (#PCDATA | emphasis | strong)*>
ID/IDREF Constraints
<!-- Remember: ID values must be unique across entire document -->
<!ATTLIST book id ID #REQUIRED>
<!ATTLIST author id ID #REQUIRED> <!-- Must not conflict with book IDs -->
Case Sensitivity
<!-- DTD is case-sensitive -->
<!ELEMENT Book (title)> <!-- Different from book -->
<!ELEMENT book (title)>
When to Use DTD
Still Appropriate For:
- Legacy system maintenance
- Simple document structures
- SGML compatibility requirements
- Quick prototyping
Consider XML Schema Instead For:
- New projects
- Complex data validation
- Namespace-heavy documents
- Rich data type requirements
Conclusion
While DTD is considered legacy technology, understanding it remains valuable for working with existing XML systems and comprehending XML's evolution. For new projects, XML Schema (XSD) typically provides better features and flexibility.
Next Steps
- Explore XML Schema (XSD) for modern validation
- Learn XML Namespaces for modular documents
- Study XML Validation for validation strategies