1. xml
  2. /basics
  3. /xml-validation

XML Validation and Document Definition

XML Validation

While XML may not be as prevalent as it was in the past, it continues to be a valuable component in certain areas. It serves as a means to structure and facilitate data exchange across applications and networks, in addition to configuring various software tools and systems.

To ensure reliable data transmission, it's crucial that XML documents adhere to specific standards and rules. XML validation, as part of this process, offers a potential safeguard against errors and helps maintain data integrity.

A Well-Formed XML Document

Before we dig deeper into validation, we'll need to define what a well-formed and valid XML document is. A well-formed XML document obeys the basic XML syntax rules, such as proper nesting of tags, quotes around attribute values, and more.

Consider the following simple example:

<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment -->
<catalog>
  <product id="101" status="active">
    <name>XML Guide</name>
    <description>Mastering &amp; Understanding XML</description>
    <price currency="USD">15.99</price>
  </product>
</catalog>

This XML document is well-formed because:

  • It has a root element <catalog>.

  • All tags are case sensitive, meaning <catalog> and <Catalog> would be different.

  • All XML elements are properly nested and have a closing tag (<product> starts and ends within <catalog>, <name>, and <description> start and end within <product>).

  • XML attribute values are quoted (id="101" and status="active").

  • It uses the special character representation (&) for the ampersand in the <description> tag.

  • It includes a comment (<!-- This is a comment -->) that is not within a start-tag or end-tag.

You can read more about XML's syntax in the Additional Resources section.

Well-Formed vs Valid XML

While all valid XML documents are well-formed, the reverse isn't necessarily true. A valid XML document follows not only XML syntax rules but also adheres to a predefined Document Type Definition (DTD) or XML Schema Definition (XSD), which further restricts the structure and type of the XML document's data. Let's explore these in detail.

Document Type Definition (DTD)

A Document Type Definition (DTD) is essentially a rulebook that outlines what constitutes a valid XML document. It delineates the structure of an XML document by specifying which elements are permitted and how they relate to one another hierarchically. A DTD can either be internal, incorporated directly within an XML document, or external, maintained as a distinct document that the XML document references.

Consider this XML document with an internal DTD example:

<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
<!DOCTYPE library [
   <!ELEMENT library (book,author,publication_year)>
   <!ELEMENT book (#PCDATA)>
   <!ELEMENT author (#PCDATA)>
   <!ELEMENT publication_year (#PCDATA)>
]>

<library>
   <book>Neuromancer</book>
   <author>William Gibson</author>
   <publication_year>1984</publication_year>
</library>

Here's what each part of this XML document with an internal DTD means:

  • <?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>: This is the XML declaration. It informs the parser about the version of XML being used (1.0 in this case), the character encoding used in the document (UTF-8 here), and the standalone attribute set to yes indicates that the XML document does not rely on an external DTD or schema.

  • <!DOCTYPE library [...]>: This is the DOCTYPE declaration. It informs the parser about the DTD associated with this XML document. Here, the root element is library.

  • <!ELEMENT library (book,author,publication_year)>: This declaration within the DTD body defines library as an element that contains book, author, and publication_year elements, in this specified order.

  • <!ELEMENT book (#PCDATA)>, <!ELEMENT author (#PCDATA)>, <!ELEMENT publication_year (#PCDATA)>: These three declarations define book, author, and publication_year as elements that contain parsed character data (PCDATA), essentially text data.

Following the DTD declarations, the XML document itself appears, with library as the root element and book, author, and publication_year as the child elements.

Remember:

  • The DOCTYPE declaration must be placed at the beginning of the document, right after the XML declaration.

  • The name in the DOCTYPE declaration should correspond to the root element of the XML document.

  • The declarations for each element start with an exclamation mark (!) and define the structure of that element in the document.

In contrast, an external DTD is written in a separate file and can be referred to by multiple XML documents. That could be useful when having multiple XML documents that share the same structure and rules. The standalone attribute in the XML declaration should be set to no, indicating the XML document requires an external source to provide its structure.

When referencing an external DTD, a SYSTEM or PUBLIC identifier is used. The SYSTEM identifier points directly to the location of the DTD file, while the PUBLIC identifier references a public resource with a formal public identifier (FPI).

Here's how the syntax for referencing an external DTD looks:

<!DOCTYPE root-element SYSTEM "path-to.dtd">
<!DOCTYPE root-element PUBLIC "FPI" "path-to.dtd">

In the first line, a SYSTEM identifier is used to specify the location of the DTD file. In the second, a PUBLIC identifier is used, followed by a URI pointing to the location of the DTD. The PUBLIC identifier can be used to identify an entry in a catalog, which is then used to fetch the DTD.

XML Schema Definition (XSD)

While DTD provides a basic structure for XML documents, an XML Schema Definition (XSD) grants a higher level of control over the document's structure and the types of data it can contain. It does so by using XML syntax to define constraints and relationships between elements and attributes. It also allows for namespaces to avoid naming conflicts when combining different XML vocabularies in a single document.

For better understanding, let's create a schema to outline a book entity in an XML document:

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">
   <xs:element name="book">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="title" type="xs:string" />
            <xs:element name="author" type="xs:string" />
            <xs:element name="publicationYear" type="xs:int" />
            <xs:element name="ISBN" type="xs:string" />
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

So, we've constructed an XSD for a book entity that should contain title, author, publicationYear, and ISBN. The order of these elements matters due to the <xs:sequence> construct.

Let's break this down:

  • <xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">: The root element of an XML Schema, with xmlns:xs attribute declaring the schema's namespace.

  • <xs:element name="book">: An element named book is defined.

  • <xs:complexType>: Indicates the book element will contain other elements.

  • <xs:sequence>: Specifies the contained elements that must appear in the order declared.

  • <xs:element name="title" type="xs:string" />, <xs:element name="author" type="xs:string" />, <xs:element name="publicationYear" type="xs:int" />, <xs:element name="ISBN" type="xs:string" />: These lines define the child elements of book, specifying their names and data types.

In a broader sense, simple types, such as xs:string, xs:int, xs:boolean, and xs:date, allow elements to contain only text. Complex types can include other elements or attributes, enabling more intricate XML structures. Global types, defined at a schema level, promote reusability across multiple elements, while attributes offer additional information attached to XML elements.

Final Thoughts

XML validation involves checking that your XML documents are not only well-formed but also compliant with the rules defined by your DTD or XSD. This process can be achieved through various online tools, browser extensions, or writing custom code leveraging XML parsers.

Online tools offer convenient features like XML syntax checking, validation of various XML-related file types, and the ability to validate XML data either by loading a URL or uploading a file. Such tools offer a user-friendly way for beginners to experiment with XML validation.

For production-grade applications, integrating XML validation into your code is recommended, which ensures the tight coupling of validation with error handling and data processing logic. Implementing XML validation in code varies by programming language, but generally involves loading the XML document, loading the DTD or XSD schema, and then passing these to a validation function provided by the language's XML library.

Irrespective of the validation tool or method choice, the goal is to ensure that XML documents comply with the defined rules, which helps in maintaining the integrity of data exchange.

Additional Resources

XML Syntax - A Detailed Overview

XML Validity and Well-Formedness Explained