XML

Navigating a Document with XPath Patterns

XPath expressions are usually built out of patterns, which describe a branch of an XML tree. A pattern therefore is used to reference one or more hierarchical nodes in a document tree. Patterns can be constructed to perform relatively complex pattern matching tasks and ultimately form somewhat of a mini-query language that is used to query documents for specific nodes. Patterns can be used to isolate specific nodes or groups of nodes and can be specified as absolute or relative. An absolute pattern spells out the exact location of a node or node set, whereas a relative pattern identifies a node or node set relative to a certain context.

The next few sections examine the ways in which patterns are used to access nodes within XML documents. To better understand how patterns are used, it's worth seeing them in the context of a real XML document. Listing 22.1 contains the code for the familiar training log sample document, which serves as the sample document in this tutorial for XPath.

Listing 22.1. The Training Log Sample XML Document
01: <?xml version="1.0"?>
02: <!DOCTYPE trainlog SYSTEM "etml.dtd">
03:
04: <trainlog>
05:   <!This session was part of the marathon training group run. >
06:   <session date="11/19/05" type="running" heartrate="158">
07:     <duration units="minutes">45</duration>
08:     <distance units="miles">5.5</distance>
09:     <location>Warner Park</location>
10:     <comments>Mid-morning run, a little winded throughout.</comments>
11:   </session>
12:
13:   <session date="11/21/05" type="cycling" heartrate="153">
14:     <duration units="hours">2.5</duration>
15:     <distance units="miles">37.0</distance>
16:     <location>Natchez Trace Parkway</location>
17:     <comments>Hilly ride, felt strong as an ox.</comments>
18:   </session>
19:
20:   <session date="11/24/05" type="running" heartrate="156">
21:     <duration units="hours">1.5</duration>
22:     <distance units="miles">8.5</distance>
23:     <location>Warner Park</location>
24:     <comments>Afternoon run, felt reasonably strong.</comments>
25:   </session>
26: </trainlog>

You may want to keep a bookmark around for this page, as several of the XPath examples throughout the next section rely on the training log sample code.

Referencing Nodes

The most basic of all XPath patterns is the pattern that references the current node, which consists of a simple period:

.

If you're traversing a document tree, a period will obtain the current node. The current node pattern is therefore a relative pattern because it makes sense only in the context of a tree of data. As a contrast to the current pattern, which is relative, consider the pattern that is used to select the root node of a document. This pattern is known as the root pattern and consists of a single forward slash:

/

If you were to use a single forward slash in an expression for the training log sample document, it would refer to the trainlog element (line 4) because this element is the root element of the document. Because the root pattern directly references a specific location in a document (the root node), it is considered an absolute pattern. The root pattern is extremely important to XPath because it represents the starting point of any document's node tree.

As you know, XPath relies on the hierarchical nature of XML documents to reference nodes. The relationship between nodes in this type of hierarchy is best described as a familial relationship, which means that nodes can be described as parent, child, or sibling nodes, depending upon the context of the tree. For example, the root node is the parent of all nodes. Nodes might be parents of some nodes and siblings of others. To reference child nodes using XPath, you use the name of the child node as the pattern. So, in the training log example, you can reference a session element (line 6, for example) as a child of the root node by simply specifying the name of the element: session. Of course, this assumes that the root node (line 4) is the current context for the pattern, in which case a relative child path is okay. If the root node isn't the current context, you should fully specify the child path as /session. Notice in this case that the root pattern is combined with a child pattern to create an absolute path.

I've mentioned context a few times in regard to node references. Context simply refers to the location within a document tree from which you are referencing a node. The context is established by the current node you are referencing. All further references are then made with respect to this node.

If there are child nodes there must also be parent nodes. To access a parent node, you must use two periods:

..

As an example, if the current context is one of the distance elements (line 15, for example) in the training log document, the .. parent pattern will reference the parent of the node, which is a session element (line 13). You can put patterns together to get more interesting results. For example, to address a sibling node, you must first go to the parent and then reference the sibling as a child. In other words, you use the parent pattern (..) followed by a forward slash (/) followed by the sibling node name, like this:

../duration

This pattern assumes that the context is one of the child elements of the session element (other than duration). Assuming this context, the ../duration pattern will reference the duration element (line 14) as a sibling node.

Thus far I've focused on referencing individual nodes. However, it's also possible to select multiple nodes. For example, you can select all of the child nodes (descendants) of a given node using the double slash pattern:

//

As an example, if the context is one of the session elements in the training log document (line 20, for example), you can select all of its child nodes by using double slashes. This results in the duration (line 21), distance (line 22), location (line 23), and comments (line 24) elements being selected.

Another way to select multiple nodes is to use the wildcard pattern, which is an asterisk:

*

The wildcard pattern selects all of the nodes in a given context. So, if the context was a session element and you used the pattern */distance, all of the distance elements in the document would be selected. This occurs because the wildcard pattern first results in all of the sibling session elements being selected, after which the selection is limited to the child distance elements.

To summarize, following are the primary building blocks used to reference nodes in XPath:

  • Current node.

  • Root node/

  • Parent node..

  • Child nodeChild

  • Sibling node/Sibling

  • All child nodes//

  • All nodes*

These pattern building blocks form the core of XPath, but they don't tell the whole story. The next section explores attributes and subsets and how they are referenced.

Referencing Attributes and Subsets

Elements aren't the only important pieces of information in XML documents; it's also important to be able to reference attributes. Fortunately, XPath makes it quite easy to reference attributes by using the "at" symbol:

@

The at symbol is used to reference attributes by preceding an attribute name:

*/distance/@units

This code selects all of the units attributes for distance elements in the training log document, assuming that the context is one of the session elements. As you can see, attributes fit right into the path notation used by XPath and are referenced in the same manner as elements, with the addition of the at (@) symbol.

One other important feature of XPath expressions is support for the selection of subsets of nodes. You select a subset by appending square brackets ([]) to the end of a pattern and then placing an expression within the brackets that defines the subset. As an example, consider the following pattern that selects all the session elements in the training log document:

*/session

It's possible that you might want to limit the session elements to a certain type of training session, such as running. To do this, you add square brackets onto the pattern, and you create an expression that checks to see if the session type is set to running:

*/session[@type='running']

This pattern results in selecting only the session elements whose type attribute is set to running. Notice that an at symbol (@) is used in front of the attribute name (type) to indicate that it is an attribute. You can also address elements by index, as the following expression demonstrates:

/session[1]

This expression selects the first session element in the document.