XPath - XML Path Language for Querying

XPath (XML Path Language) is a powerful query language for selecting nodes or values from XML documents. It uses a path-like syntax similar to file system paths and provides a rich set of functions for manipulating and testing XML data. XPath is fundamental to many XML technologies including XSLT, XQuery, and XML Schema.

XPath Basics

Node Types

XPath recognizes several types of nodes:

Element nodes: <book>, <title>
Attribute nodes: @id, @class
Text nodes: Text content within elements
Comment nodes: 
Processing instruction nodes: <?xml-stylesheet?>
Namespace nodes: Namespace declarations
Document node: The root of the document tree

Sample XML Document

We'll use this XML throughout our examples:

<?xml version="1.0" encoding="UTF-8"?>
<library>
    <book id="1" genre="fiction">
        <title>To Kill a Mockingbird</title>
        <author>Harper Lee</author>
        <price currency="USD">12.99</price>
        <publishDate>1960-07-11</publishDate>
    </book>
    <book id="2" genre="science">
        <title>A Brief History of Time</title>
        <author>Stephen Hawking</author>
        <price currency="USD">15.99</price>
        <publishDate>1988-04-01</publishDate>
    </book>
    <book id="3" genre="fiction">
        <title>1984</title>
        <author>George Orwell</author>
        <price currency="USD">13.99</price>
        <publishDate>1949-06-08</publishDate>
    </book>
</library>

Basic Path Expressions

Absolute Paths

Start from the document root with /:

/library                    <!-- Root library element -->
/library/book               <!-- All book elements -->
/library/book/title         <!-- All title elements -->
/library/book[1]/title      <!-- Title of first book -->

Relative Paths

Start from the current context node:

book                        <!-- Child book elements -->
book/title                  <!-- Title elements of child books -->
.                          <!-- Current node -->
..                         <!-- Parent node -->

Descendant Operator

Use // to select descendants at any level:

//book                      <!-- All book elements anywhere -->
//title                     <!-- All title elements anywhere -->
/library//price             <!-- All price elements under library -->

Node Selection

Selecting by Position

/library/book[1]            <!-- First book -->
/library/book[last()]       <!-- Last book -->
/library/book[last()-1]     <!-- Second to last book -->
/library/book[position() < 3]  <!-- First two books -->

Attribute Selection

/library/book/@id           <!-- All id attributes -->
/library/book/@*            <!-- All attributes of book elements -->
//price/@currency          <!-- All currency attributes -->

Text Content

/library/book/title/text()  <!-- Text content of titles -->
//author/text()             <!-- Text content of all authors -->

Predicates (Filters)

Predicates filter nodes using conditions in square brackets:

Basic Predicates

/library/book[@genre]               <!-- Books with genre attribute -->
/library/book[@genre='fiction']     <!-- Books with genre='fiction' -->
/library/book[price > 14]           <!-- Books with price > 14 -->
/library/book[position() = 2]       <!-- Second book -->

Multiple Conditions

/library/book[@genre='fiction' and price < 14]     <!-- Fiction books under $14 -->
/library/book[@genre='fiction' or @genre='science'] <!-- Fiction or science books -->
/library/book[not(@genre='fiction')]               <!-- Non-fiction books -->

Text Content Predicates

/library/book[title='1984']                     <!-- Book with title "1984" -->
/library/book[contains(title, 'Time')]          <!-- Books with "Time" in title -->
/library/book[starts-with(author, 'George')]    <!-- Books by authors starting with "George" -->

Axes

Axes define the relationship between nodes:

Child Axis (default)

child::book                 <!-- Same as book -->
/library/child::book        <!-- All book children of library -->

Descendant Axis

descendant::title           <!-- Same as //title -->
/library/descendant::price  <!-- All price descendants of library -->

Parent Axis

parent::*                   <!-- Parent element -->
../                         <!-- Same as parent::* -->

Ancestor Axis

ancestor::*                 <!-- All ancestors -->
ancestor::library           <!-- library ancestor -->
ancestor-or-self::book      <!-- book ancestors including self -->

Sibling Axes

following-sibling::book     <!-- Following sibling books -->
preceding-sibling::book     <!-- Preceding sibling books -->

Attribute Axis

attribute::id               <!-- Same as @id -->
attribute::*                <!-- All attributes -->

XPath Functions

String Functions

<!-- Text manipulation -->
string(/library/book[1]/title)              <!-- Convert to string -->
concat(author, ' - ', title)                <!-- Concatenate strings -->
substring(title, 1, 5)                      <!-- First 5 characters -->
string-length(title)                        <!-- Length of title -->
normalize-space(title)                      <!-- Remove extra whitespace -->

<!-- String testing -->
contains(title, 'Time')                     <!-- Contains substring -->
starts-with(author, 'Stephen')              <!-- Starts with string -->
ends-with(title, 'Time')                    <!-- Ends with string (XPath 2.0+) -->

Numeric Functions

count(/library/book)                        <!-- Count book elements -->
sum(/library/book/price)                    <!-- Sum of all prices -->
avg(/library/book/price)                    <!-- Average price (XPath 2.0+) -->
max(/library/book/price)                    <!-- Maximum price (XPath 2.0+) -->
min(/library/book/price)                    <!-- Minimum price (XPath 2.0+) -->
round(price)                                <!-- Round to nearest integer -->
ceiling(price)                              <!-- Round up -->
floor(price)                                <!-- Round down -->

Boolean Functions

boolean(price)                              <!-- True if price exists -->
not(@genre='fiction')                       <!-- Logical NOT -->
true()                                      <!-- Always true -->
false()                                     <!-- Always false -->

Node Functions

position()                                  <!-- Position of current node -->
last()                                      <!-- Position of last node -->
name()                                      <!-- Name of current node -->
local-name()                                <!-- Local name (without namespace) -->
namespace-uri()                             <!-- Namespace URI -->

Advanced Expressions

Union Operator

Select multiple node sets:

/library/book/title | /library/book/author     <!-- All titles and authors -->
//title | //author                             <!-- All titles and authors anywhere -->

Complex Predicates

<!-- Books published after 1950 -->
/library/book[number(substring(publishDate, 1, 4)) > 1950]

<!-- Books with titles longer than 10 characters -->
/library/book[string-length(title) > 10]

<!-- Books where price is above average -->
/library/book[price > avg(//price)]

Conditional Logic

<!-- XPath 2.0+ conditional expression -->
if (price > 15) then 'expensive' else 'affordable'

<!-- Using predicates for conditions -->
/library/book[price > 15]/'expensive'
/library/book[price <= 15]/'affordable'

Namespaces in XPath

When working with namespaced XML:

<lib:library xmlns:lib="http://example.com/library">
    <lib:book lib:id="1">
        <lib:title>Sample Book</lib:title>
    </lib:book>
</lib:library>

<!-- After registering lib prefix -->
/lib:library/lib:book
/lib:library/lib:book/@lib:id
//lib:title

XPath in Different Contexts

JavaScript (DOM)

// Using XPath in the browser
const xpath = '//book[@genre="fiction"]/title';
const result = document.evaluate(
    xpath, 
    document, 
    null, 
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, 
    null
);

for (let i = 0; i < result.snapshotLength; i++) {
    const node = result.snapshotItem(i);
    console.log(node.textContent);
}

Python (lxml)

from lxml import etree

# Parse XML
doc = etree.parse('library.xml')

# XPath queries
titles = doc.xpath('//book[@genre="fiction"]/title/text()')
prices = doc.xpath('//price[. > 14]/text()')
books = doc.xpath('//book[contains(title, "Time")]')

# With namespaces
namespaces = {'lib': 'http://example.com/library'}
books = doc.xpath('//lib:book', namespaces=namespaces)

Java (javax.xml.xpath)

XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xpath = xPathFactory.newXPath();

// Simple query
String expression = "//book[@genre='fiction']/title";
NodeList nodes = (NodeList) xpath.evaluate(
    expression, document, XPathConstants.NODESET
);

// With custom namespace context
xpath.setNamespaceContext(new NamespaceContext() {
    public String getNamespaceURI(String prefix) {
        return "http://example.com/library";
    }
    // ... other methods
});

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <!-- XPath in select attributes -->
        <xsl:for-each select="//book[@genre='fiction']">
            <xsl:value-of select="title"/>
        </xsl:for-each>
        
        <!-- XPath in test attributes -->
        <xsl:if test="count(//book) > 2">
            <p>Many books available</p>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Performance Considerations

Efficient XPath Expressions

<!-- Efficient: Specific path -->
/library/book[@id='1']

<!-- Less efficient: Descendant search -->
//book[@id='1']

<!-- Efficient: Use position -->
/library/book[1]

<!-- Less efficient: Use predicate -->
/library/book[position()=1]

Indexing Strategies

Use specific paths when possible
Avoid deep descendant searches (//) when unnecessary
Use predicates early to filter results
Consider the document structure when designing XPath

XPath 2.0 and 3.0 Enhancements

Sequence Operations (XPath 2.0+)

(1 to 5)                            <!-- Sequence: 1, 2, 3, 4, 5 -->
//book[position() = (1 to 3)]       <!-- First three books -->
distinct-values(//author)           <!-- Unique author names -->

Enhanced Functions (XPath 2.0+)

matches(title, '[0-9]+')            <!-- Regular expression matching -->
replace(title, 'Time', 'Era')       <!-- String replacement -->
tokenize(author, ' ')               <!-- Split string into sequence -->
format-number(price, '#.00')       <!-- Number formatting -->

Date/Time Functions (XPath 2.0+)

current-date()                      <!-- Current date -->
year-from-date(publishDate)         <!-- Extract year from date -->
format-date(publishDate, '[Y0001]-[M01]-[D01]')  <!-- Format date -->

Common Patterns and Examples

Finding Specific Content

<!-- Books by specific author -->
//book[author='Stephen Hawking']

<!-- Books in price range -->
//book[price >= 12 and price <= 15]

<!-- Books published in specific decade -->
//book[substring(publishDate, 1, 3) = '196']

Aggregation and Calculation

<!-- Total number of books -->
count(//book)

<!-- Average price -->
sum(//price) div count(//price)

<!-- Most expensive book -->
//book[price = max(//price)]

<!-- Books above average price -->
//book[price > (sum(//price) div count(//price))]

Text Processing

<!-- Books with single-word titles -->
//book[not(contains(title, ' '))]

<!-- Authors with last name starting with specific letter -->
//book[starts-with(substring-after(author, ' '), 'H')]

<!-- Normalize and compare text -->
//book[normalize-space(title) = '1984']

Debugging XPath Expressions

Testing Strategies

Start simple: Begin with basic paths and add complexity
Use browser tools: Most browsers support XPath in developer console
Test incrementally: Build expressions step by step
Validate context: Ensure you understand the current node context

Common Mistakes

<!-- Wrong: Missing root slash -->
library/book                <!-- Relative path -->

<!-- Correct: Absolute path -->
/library/book

<!-- Wrong: Incorrect predicate syntax -->
/library/book[@id=1]         <!-- Missing quotes -->

<!-- Correct: Proper quoting -->
/library/book[@id='1']

Best Practices

Expression Design

Be specific: Use precise paths when possible
Use meaningful predicates: Filter early and effectively
Consider performance: Avoid unnecessary descendant axes
Document complex expressions: Add comments for maintainability

Maintainability

<!-- Good: Self-documenting -->
/library/book[@genre='fiction'][price < 15]

<!-- Better: With context -->
//book[@genre='fiction' and price < 15]/title

Conclusion

XPath is an essential skill for working with XML documents. Its powerful selection and filtering capabilities make it indispensable for XML processing, transformation, and querying. Mastering XPath syntax and functions will significantly enhance your ability to work with XML data effectively.

Next Steps

Learn XSLT to transform XML using XPath
Explore XQuery for more complex querying
Study XML Processing to use XPath in applications