What is XPath
XPath (XML Path Language) is a powerful query language for selecting nodes or values from XML documents. It uses a path-like syntax similar to file system paths and provides a rich set of functions for manipulating and testing XML data. XPath is fundamental to many XML technologies including XSLT, XQuery, and XML Schema.
XPath Basics
Node Types
XPath recognizes several types of nodes:
- Element nodes:
<book>
,<title>
- Attribute nodes:
@id
,@class
- Text nodes: Text content within elements
- Comment nodes:
<!-- comment -->
- Processing instruction nodes:
<?xml-stylesheet?>
- Namespace nodes: Namespace declarations
- Document node: The root of the document tree
Sample XML Document
We'll use this XML throughout our examples:
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book id="1" genre="fiction">
<title>To Kill a Mockingbird</title>
<author>Harper Lee</author>
<price currency="USD">12.99</price>
<publishDate>1960-07-11</publishDate>
</book>
<book id="2" genre="science">
<title>A Brief History of Time</title>
<author>Stephen Hawking</author>
<price currency="USD">15.99</price>
<publishDate>1988-04-01</publishDate>
</book>
<book id="3" genre="fiction">
<title>1984</title>
<author>George Orwell</author>
<price currency="USD">13.99</price>
<publishDate>1949-06-08</publishDate>
</book>
</library>
Basic Path Expressions
Absolute Paths
Start from the document root with /
:
/library <!-- Root library element -->
/library/book <!-- All book elements -->
/library/book/title <!-- All title elements -->
/library/book[1]/title <!-- Title of first book -->
Relative Paths
Start from the current context node:
book <!-- Child book elements -->
book/title <!-- Title elements of child books -->
. <!-- Current node -->
.. <!-- Parent node -->
Descendant Operator
Use //
to select descendants at any level:
//book <!-- All book elements anywhere -->
//title <!-- All title elements anywhere -->
/library//price <!-- All price elements under library -->
Node Selection
Selecting by Position
/library/book[1] <!-- First book -->
/library/book[last()] <!-- Last book -->
/library/book[last()-1] <!-- Second to last book -->
/library/book[position() < 3] <!-- First two books -->
Attribute Selection
/library/book/@id <!-- All id attributes -->
/library/book/@* <!-- All attributes of book elements -->
//price/@currency <!-- All currency attributes -->
Text Content
/library/book/title/text() <!-- Text content of titles -->
//author/text() <!-- Text content of all authors -->
Predicates (Filters)
Predicates filter nodes using conditions in square brackets:
Basic Predicates
/library/book[@genre] <!-- Books with genre attribute -->
/library/book[@genre='fiction'] <!-- Books with genre='fiction' -->
/library/book[price > 14] <!-- Books with price > 14 -->
/library/book[position() = 2] <!-- Second book -->
Multiple Conditions
/library/book[@genre='fiction' and price < 14] <!-- Fiction books under $14 -->
/library/book[@genre='fiction' or @genre='science'] <!-- Fiction or science books -->
/library/book[not(@genre='fiction')] <!-- Non-fiction books -->
Text Content Predicates
/library/book[title='1984'] <!-- Book with title "1984" -->
/library/book[contains(title, 'Time')] <!-- Books with "Time" in title -->
/library/book[starts-with(author, 'George')] <!-- Books by authors starting with "George" -->
Axes
Axes define the relationship between nodes:
Child Axis (default)
child::book <!-- Same as book -->
/library/child::book <!-- All book children of library -->
Descendant Axis
descendant::title <!-- Same as //title -->
/library/descendant::price <!-- All price descendants of library -->
Parent Axis
parent::* <!-- Parent element -->
../ <!-- Same as parent::* -->
Ancestor Axis
ancestor::* <!-- All ancestors -->
ancestor::library <!-- library ancestor -->
ancestor-or-self::book <!-- book ancestors including self -->
Sibling Axes
following-sibling::book <!-- Following sibling books -->
preceding-sibling::book <!-- Preceding sibling books -->
Attribute Axis
attribute::id <!-- Same as @id -->
attribute::* <!-- All attributes -->
XPath Functions
String Functions
<!-- Text manipulation -->
string(/library/book[1]/title) <!-- Convert to string -->
concat(author, ' - ', title) <!-- Concatenate strings -->
substring(title, 1, 5) <!-- First 5 characters -->
string-length(title) <!-- Length of title -->
normalize-space(title) <!-- Remove extra whitespace -->
<!-- String testing -->
contains(title, 'Time') <!-- Contains substring -->
starts-with(author, 'Stephen') <!-- Starts with string -->
ends-with(title, 'Time') <!-- Ends with string (XPath 2.0+) -->
Numeric Functions
count(/library/book) <!-- Count book elements -->
sum(/library/book/price) <!-- Sum of all prices -->
avg(/library/book/price) <!-- Average price (XPath 2.0+) -->
max(/library/book/price) <!-- Maximum price (XPath 2.0+) -->
min(/library/book/price) <!-- Minimum price (XPath 2.0+) -->
round(price) <!-- Round to nearest integer -->
ceiling(price) <!-- Round up -->
floor(price) <!-- Round down -->
Boolean Functions
boolean(price) <!-- True if price exists -->
not(@genre='fiction') <!-- Logical NOT -->
true() <!-- Always true -->
false() <!-- Always false -->
Node Functions
position() <!-- Position of current node -->
last() <!-- Position of last node -->
name() <!-- Name of current node -->
local-name() <!-- Local name (without namespace) -->
namespace-uri() <!-- Namespace URI -->
Advanced Expressions
Union Operator
Select multiple node sets:
/library/book/title | /library/book/author <!-- All titles and authors -->
//title | //author <!-- All titles and authors anywhere -->
Complex Predicates
<!-- Books published after 1950 -->
/library/book[number(substring(publishDate, 1, 4)) > 1950]
<!-- Books with titles longer than 10 characters -->
/library/book[string-length(title) > 10]
<!-- Books where price is above average -->
/library/book[price > avg(//price)]
Conditional Logic
<!-- XPath 2.0+ conditional expression -->
if (price > 15) then 'expensive' else 'affordable'
<!-- Using predicates for conditions -->
/library/book[price > 15]/'expensive'
/library/book[price <= 15]/'affordable'
Namespaces in XPath
When working with namespaced XML:
<lib:library xmlns:lib="http://example.com/library">
<lib:book lib:id="1">
<lib:title>Sample Book</lib:title>
</lib:book>
</lib:library>
Register namespaces in your XPath processor:
<!-- After registering lib prefix -->
/lib:library/lib:book
/lib:library/lib:book/@lib:id
//lib:title
XPath in Different Contexts
JavaScript (DOM)
// Using XPath in the browser
const xpath = '//book[@genre="fiction"]/title';
const result = document.evaluate(
xpath,
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
for (let i = 0; i < result.snapshotLength; i++) {
const node = result.snapshotItem(i);
console.log(node.textContent);
}
Python (lxml)
from lxml import etree
# Parse XML
doc = etree.parse('library.xml')
# XPath queries
titles = doc.xpath('//book[@genre="fiction"]/title/text()')
prices = doc.xpath('//price[. > 14]/text()')
books = doc.xpath('//book[contains(title, "Time")]')
# With namespaces
namespaces = {'lib': 'http://example.com/library'}
books = doc.xpath('//lib:book', namespaces=namespaces)
Java (javax.xml.xpath)
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xpath = xPathFactory.newXPath();
// Simple query
String expression = "//book[@genre='fiction']/title";
NodeList nodes = (NodeList) xpath.evaluate(
expression, document, XPathConstants.NODESET
);
// With custom namespace context
xpath.setNamespaceContext(new NamespaceContext() {
public String getNamespaceURI(String prefix) {
return "http://example.com/library";
}
// ... other methods
});
XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<!-- XPath in select attributes -->
<xsl:for-each select="//book[@genre='fiction']">
<xsl:value-of select="title"/>
</xsl:for-each>
<!-- XPath in test attributes -->
<xsl:if test="count(//book) > 2">
<p>Many books available</p>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Performance Considerations
Efficient XPath Expressions
<!-- Efficient: Specific path -->
/library/book[@id='1']
<!-- Less efficient: Descendant search -->
//book[@id='1']
<!-- Efficient: Use position -->
/library/book[1]
<!-- Less efficient: Use predicate -->
/library/book[position()=1]
Indexing Strategies
- Use specific paths when possible
- Avoid deep descendant searches (
//
) when unnecessary - Use predicates early to filter results
- Consider the document structure when designing XPath
XPath 2.0 and 3.0 Enhancements
Sequence Operations (XPath 2.0+)
(1 to 5) <!-- Sequence: 1, 2, 3, 4, 5 -->
//book[position() = (1 to 3)] <!-- First three books -->
distinct-values(//author) <!-- Unique author names -->
Enhanced Functions (XPath 2.0+)
matches(title, '[0-9]+') <!-- Regular expression matching -->
replace(title, 'Time', 'Era') <!-- String replacement -->
tokenize(author, ' ') <!-- Split string into sequence -->
format-number(price, '#.00') <!-- Number formatting -->
Date/Time Functions (XPath 2.0+)
current-date() <!-- Current date -->
year-from-date(publishDate) <!-- Extract year from date -->
format-date(publishDate, '[Y0001]-[M01]-[D01]') <!-- Format date -->
Common Patterns and Examples
Finding Specific Content
<!-- Books by specific author -->
//book[author='Stephen Hawking']
<!-- Books in price range -->
//book[price >= 12 and price <= 15]
<!-- Books published in specific decade -->
//book[substring(publishDate, 1, 3) = '196']
Aggregation and Calculation
<!-- Total number of books -->
count(//book)
<!-- Average price -->
sum(//price) div count(//price)
<!-- Most expensive book -->
//book[price = max(//price)]
<!-- Books above average price -->
//book[price > (sum(//price) div count(//price))]
Text Processing
<!-- Books with single-word titles -->
//book[not(contains(title, ' '))]
<!-- Authors with last name starting with specific letter -->
//book[starts-with(substring-after(author, ' '), 'H')]
<!-- Normalize and compare text -->
//book[normalize-space(title) = '1984']
Debugging XPath Expressions
Testing Strategies
- Start simple: Begin with basic paths and add complexity
- Use browser tools: Most browsers support XPath in developer console
- Test incrementally: Build expressions step by step
- Validate context: Ensure you understand the current node context
Common Mistakes
<!-- Wrong: Missing root slash -->
library/book <!-- Relative path -->
<!-- Correct: Absolute path -->
/library/book
<!-- Wrong: Incorrect predicate syntax -->
/library/book[@id=1] <!-- Missing quotes -->
<!-- Correct: Proper quoting -->
/library/book[@id='1']
Best Practices
Expression Design
- Be specific: Use precise paths when possible
- Use meaningful predicates: Filter early and effectively
- Consider performance: Avoid unnecessary descendant axes
- Document complex expressions: Add comments for maintainability
Maintainability
<!-- Good: Self-documenting -->
/library/book[@genre='fiction'][price < 15]
<!-- Better: With context -->
//book[@genre='fiction' and price < 15]/title
Conclusion
XPath is an essential skill for working with XML documents. Its powerful selection and filtering capabilities make it indispensable for XML processing, transformation, and querying. Mastering XPath syntax and functions will significantly enhance your ability to work with XML data effectively.
Next Steps
- Learn XSLT to transform XML using XPath
- Explore XQuery for more complex querying
- Study XML Processing to use XPath in applications