1. xml
  2. /basics
  3. /syntax

XML Syntax - A Detailed Overview

The Role and Syntax of XML

XML, like HTML, is a derivative of SGML, an international standard for defining markup languages. HTML has been the principal language for content presentation on the web. However, XML offers an alternative approach, introducing a way for defining your own set of markup elements.

Unlike HTML which is primarily geared toward how data looks (presentation), XML focuses on what data is (content). Therefore, in certain cases they complement each other: XML handles data description and transport, while HTML deals with data presentation.

XML itself isn't a programming language in the conventional sense. It doesn't run or execute. Rather, XML files serve as structured data repositories ready for interaction with applications that display, process, or modify the data.

It's the syntax of XML that allows it to accomplish these tasks, which is what we'll focus on next. Let's start with the structure of an XML document.

XML Structure: Roots, Elements, and Hierarchies

XML documents embody a "tree" structure, composed of elements arranged in a hierarchical relationship. Every XML document starts with a single root element that encompasses all other elements.

To elaborate, let's consider a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<book>
    <title>
        Learning XML
    </title>
</book>

The first line, <?xml version="1.0" encoding="UTF-8"?>, is the XML declaration or prolog. It's optional but highly recommended, as it defines the XML version and the character encoding used in the document, ensuring compatibility and clarity.

<book> is the root element of this document, and it contains the <title> element, its child. The text "Learning XML" is encapsulated within the <title> element.

An XML structure can support deeper nesting, as illustrated below:

<root>
  <child>
    Example Data
    <subchild>More Example Data</subchild>
  </child>
  <child>Example Data</child>
</root>

Here, <root> would be the primary element with <child> elements as its direct descendants. As we can notice, one of these <child> elements would further contain a <subchild> element.

Now that we have a general understanding of XML's structure, we'll go over the syntax specifics, including elements, attributes, and the rules that shape them.

XML Elements: A Closer Look

Elements are the core building blocks of XML. Every XML element must have an opening and closing tag, with the content placed in between.

For instance, <greeting>Hello, World!</greeting> is a valid XML element. The tags in XML are case-sensitive, implying <greeting> and <Greeting> are distinct elements.

Moreover, XML elements can be empty, with no content between the opening and closing tags. In such cases, the tags can also be combined into a single self-closing tag, like so <element/>, and they can contain attributes which we'll explain later on.

Naming Rules and Best Practices for XML Elements

When naming XML elements, certain rules must be followed:

  • The element names are case-sensitive.

  • They must begin with a letter or underscore — they can't start with the letters "xml", or variations such as ("Xml", "XML", or others).

  • Names can contain letters, digits, hyphens, underscores, and periods, but cannot contain spaces.

These represent the "hard" rules, but some best practices should also be considered for better readability and to avoid issues with certain software systems:

  • Names should be descriptive and concise.

  • Using hyphens ("-"), periods ("."), and colons (":") in names should be avoided due to potential conflicts with software operations. Specifically, colons are related to and reserved for namespaces.

  • Non-English letters are allowed, but they might produce issues related to software compatibility.

Nesting Elements

The nested structure of XML elements, which we touched on earlier, is a key feature of XML. It means that any element opened within another must also be closed within that same element.

Revisiting our previous example, we see this principle in action:

<library>
  <book>
    <title>Learning XML</title>
    <author>Jane Doe</author>
  </book>
  <book>
    <title>Mastering XML</title>
    <author>John Doe</author>
  </book>
</library>

So, <library> is the root element, containing two <book> elements. Each <book> element further houses <title> and <author> elements. Crucially, each <title> and <author> element is nested inside its respective <book> element, contributing to the well-structured nature of the XML.

The extensibility of XML elements allows you to define your tags, tailored to your data's structure and semantics. In addition, XML elements can hold attributes, which provide extra information about the element.

XML Attributes

Attributes provide additional information about XML elements. They are always defined within the start tag of an element, in the format name="value". If you're familiar with HTML you'll likely notice the similarity.

Let's enrich our previous example with some attributes:

<library name="City Library">
  <book id="1">
    <title>Learning XML</title>
    <author>Jane Doe</author>
  </book>
  <book id="2">
    <title>Mastering XML</title>
    <author>John Doe</author>
  </book>
</library>

Now, our <library> element has an attribute name with the value "City Library", and each <book> element has an id attribute. Remember, the attribute value must always be enclosed in quotes, either single (') or double (").

Attributes are designed to hold data that is related to the specific element. However, while elements can contain multiple values and even tree structures, attributes cannot. This makes attributes less flexible for changes and less suitable for complex data structures.

This principle should be considered when deciding whether to use an attribute or an element to represent a particular piece of data. While XML doesn't enforce strict rules about when to use elements versus attributes, a common practice is to use attributes for metadata – data about data – and use elements for the core data itself.

Handling Special Characters with Entity References

XML uses entity references to handle special characters that could be problematic when parsing data, such as <, &, or large blocks of repeatable data. For instance, the less than sign (<) denotes the start of a tag and can be misinterpreted if it's part of your data. To include such characters, you can use their entity references, which begin with & and end with ;. Therefore, < becomes &lt;.

Below, you can find the five predefined entity references in XML:

Disallowed CharacterEntity ReferenceCharacter Description
<<Less than sign
>>Greater than sign
&&Ampersand
''Apostrophe
""Quotation mark

White Space and Comments

XML treats white space differently from HTML by retaining multiple spaces. Comments in XML, denoted by <!-- and -->, are similar to HTML, but two dashes in the middle of a comment aren't allowed.

Combining all the aspects we've discussed, a more detailed example of an XML document could look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment -->
<library name="City Library">
  <book id="1">
    <title>Learning XML</title>
    <author>Jane Doe</author>
    <price>Price is &lt; $30</price>
  </book>
  <book id="2">
    <title>Mastering XML</title>
    <author>John Doe</author>
    <price>Price is &gt; $30</price>
  </book>
</library>

The above XML document is "well-formed", and it follows the rules of XML syntax and structure, crucial for an XML parser to read and interpret the document successfully.

Additional Resources

W3C's XML Specification

Curated XML FAQ