XML

XML Data Querying 101

XQuery is one of those technologies that is best understood by jumping in and experimenting with it. So, let's hit the ground running and look at how XQuery is used to query an XML document. The sample XML document, a trimmed down version of a document included in earlier chapters, shown in Listing 18.1, is used in the examples that follow.

Listing 18.1. A Sample XML Document Containing Vehicle Data
 1: <?xml version="1.0"?>
 2:
 3: <vehicles>
 4:   <vehicle year="2004" make="Acura" model="3.2TL">
 5:     <mileage>13495</mileage>
 6:     <color>green</color>
 7:     <price>33900</price>
 8:     <options>
 9:       <option>navigation system</option>
10:       <option>heated seats</option>
11:     </options>
12:   </vehicle>
13:
14:   <vehicle year="2005" make="Acura" model="3.2TL">
15:     <mileage>07541</mileage>
16:     <color>white</color>
17:     <price>33900</price>
18:     <options>
19:       <option>spoiler</option>
20:       <option>ground effects</option>
21:     </options>
22:   </vehicle>
23:
24:   <vehicle year="2004" make="Acura" model="3.2TL">
25:     <mileage>18753</mileage>
26:     <color>white</color>
27:     <price>32900</price>
28:     <options />
29:   </vehicle>
30: </vehicles>

Now let's take a look at some simple XQuery queries that can be used to retrieve data from that document. The syntax for XQuery is very lean, and in fact borrows heavily from a related technology called XPath; you learn a great deal more about XPath in Addressing And Linking XML Documents, "Addressing and Linking XML Documents." As an example, the query that retrieves all of the color elements from the document is:

for $c in //color
return $c

This query returns the following:

<?xml version="1.0" encoding="UTF-8"?>
<color>green</color>
<color>white</color>
<color>white</color>

The queries are intended to be typed into an application that supports XQuery, or to be used within XQuery queries that are passed into an XQuery processor. The results of the query are displayed afterward, to show what would be returned.

This query asks to return all of the child elements named color in the document. The // operator is used to return elements anywhere below another element, which in this case indicates that all color elements in the document should be returned. You could have just as easily coded this example as:

for $c in vehicles/vehicle/color
return $c

The $c in these examples serves as a variable, or placeholder, that holds the results of the query. You can think of the query results as a loop where each matching element is grabbed one after the next. In this case, all you're doing is returning the results for further processing or for writing to an XML document.

If you're familiar with the for loop in a programming language such as BASIC, Java, or C++, the for construct in XQuery won't be entirely foreign, even if it doesn't involve setting up a counter as in traditional for loops.

As the previous code reveals, a / at the beginning of a query string indicates the root level of the document structure or a relative folder level separation. For example, the query that follows wouldn't return anything because color is not the root level element of the document.

/color

All of this node addressing syntax is technically part of XPath, which makes up a considerable part of the XQuery technology. You learn a great deal more about the ins and outs of XPath in Addressing And Linking XML Documents. As you can see, aside from a few wrinkles, requesting elements from an XML document using XQuery/XPath isn't all that different from locating files in a file system using a command shell.

In XQuery/XPath, expressions within square brackets ([]) are subqueries. Those expressions are not used to retrieve elements themselves but to qualify the elements that are retrieved. For example, a query such as

//vehicle/color

retrieves color elements that are children of vehicle elements. On the other hand, this query

//vehicle[color]

retrieves vehicle elements that have a color element as a child. Subqueries are particularly useful when you use them with filters to write very specific queries.

Querying with Wildcards

Continuing along with the vehicle code example, let's say you want to find all of the option elements that are grandchildren of the vehicle element. To get them all from the sample document, you could just use the query vehicles/vehicle/options/option. However, let's say that you didn't know that the intervening element was options or that there were other elements that could intervene between vehicle and option. In that case, you could use the following query:

for $o in vehicles/vehicle/*/option
return $o

Following are the results of this query:

<?xml version="1.0" encoding="UTF-8"?>
<option>navigation system</option>
<option>heated seats</option>
<option>spoiler</option>
<option>ground effects</option>

The wildcard (*) matches any element. You can also use it at the end of a query to match all the children of a particular element.

Using Filters to Search for Specific Information

After you've mastered the extraction of specific elements from XML files, you can move on to searching for elements that contain information you specify. Let's say you want to find higher-level elements containing a particular value in a child element. The [] operator indicates that the expression within the square braces should be searched but that the element listed to the left of the square braces should be returned. For example, the following expression would read "return any vehicle elements that contain a color element with a value of green:

for $v in //vehicle[color='green']
return $v

Here are the results:

<?xml version="1.0" encoding="UTF-8"?>
<vehicle year="2004" make="Acura" model="3.2TL">
  <mileage>13495</mileage>
  <color>green</color>
  <price>33900</price>
  <options>
    <option>navigation system</option>
    <option>heated seats</option>
  </options>
</vehicle>

The full vehicle element is returned because it appears to the left of the search expression enclosed in the square braces. You can also use Boolean operators such as and and or to string multiple search expressions together. For example, to find all of the vehicles with a color of green or a price less than 34000, you would use the following query:

for $v in //vehicle[color='green' or price<'34000']
return $v

This query results in the following:

<?xml version="1.0" encoding="UTF-8"?>
<vehicle year="2004" make="Acura" model="3.2TL">
  <mileage>13495</mileage>
  <color>green</color>
  <price>33900</price>
  <options>
    <option>navigation system</option>
    <option>heated seats</option>
  </options>
</vehicle>
<vehicle year="2004" make="Acura" model="3.2TL">
  <mileage>18753</mileage>
  <color>white</color>
  <price>32900</price>
  <options/>
</vehicle>

The != operator is also available when you want to write expressions to test for inequality. Additionally, there are actually three common Boolean operators: and, or, and not. For example, you can combine these operators to write complex queries, such as this:

for $v in //vehicle[not(color='blue' or color='green') and @year='2004']
return $v

This example is a little more interesting in that it looks for vehicles that aren't blue or green but that are in the model year 2004. Following are the results:

<?xml version="1.0" encoding="UTF-8"?>
<vehicle year="2004" make="Acura" model="3.2TL">
    <mileage>18753</mileage>
    <color>white</color>
    <price>32900</price>
    <options/>
</vehicle>

You might be wondering about the at symbol (@) in front of the year in the query. If you recall from the vehicles sample document (Listing 18.1), year is an attribute, not a child element@ is used to reference attributes in XQuery. More on attributes in a moment.

Just to make sure you understand subqueries, what if you wanted to retrieve just the options for any white cars in the document? Here's the query:

//vehicle[color='white']/options

And here's the result:

<?xml version="1.0" encoding="UTF-8"?>
<options>
  <option>spoiler</option>
  <option>ground effects</option>
</options>
<options/>

Let's break down that query. Remember that // means "anywhere in the hierarchy." The //vehicle part indicates that you're looking for elements inside a vehicle element. The [color='white'] part indicates that you're interested only in vehicle elements containing color elements with a value of white. The part you haven't yet seen is /options. This indicates that the results should be any options elements under vehicle elements that contain a color element matching white.

Referencing Attributes

The next thing to look at is attributes. When you want to refer to an attribute, place an @ sign before its name. So, to find all the year attributes of vehicle elements, use the following query:

//vehicle/@year

You can write a slightly different query that returns all of the vehicle elements that have year attributes as well:

//vehicle[@year]

This naturally leads up to writing a query that returns all the vehicle elements that have a year attribute with a certain value, say 2005. That complete query is

for $v in //vehicle[@year="2005"]
return $v