Because you already know the scoop on SAX, Java, and the Xerces SAX parser for Java, let's go ahead and jump right into the program code. Here are the first 12 lines of Java code:
import org.xml.sax.Attributes; import org.xml.sax.ContentHandler; import org.xml.sax.ErrorHandler; import org.xml.sax.Locator; import org.xml.sax.SAXParseException; import org.xml.sax.XMLReader; public class DocumentPrinter implements ContentHandler, ErrorHandler { // A constant containing the name of the SAX parser to use. private static final String PARSER_NAME = "org.apache.xerces.parsers.SAXParser";
This code imports classes that will be used later on and declares the class (program) that you're currently writing. The import
statements indicate which classes will be used by this program. In this case, all of the classes that will be used are from the org.xml.sax
package and are included in the xercesImpl.jar
and xml-apis.jar
archives.
This class, called DocumentPrinter
, implements two interfacesContentHandler
and ErrorHandler
. These two interfaces are part of the standard SAX 2.0 package and are included in the import
list. A program that implements ContentHandler
is set up to handle events passed back in the normal course of parsing an XML document, and a program that implements ErrorHandler
can handle any error events generated during SAX parsing.
In the Java world, an interface is a framework that specifies a list of methods that must be defined in a class. An interface is useful because it guarantees that any class that implements it meets the requirements of that interface. If you fail to include all of the methods required by the interface, your program will not compile. Because this program implements ContentHandler
and ErrorHandler
, the parser can be certain that it is capable of handling all of the events it triggers as it parses a document.
After the class has been declared, a single member variable is created for the class, PARSER_NAME
. This variable is a constant that contains the name of the class that you're going to use as the SAX parser. As you learned earlier, there is any number of SAX parsers available. The Xerces parser just so happens to be one of the better Java SAX parsers out there, which explains the parser name of org.apache.xerces.parsers.SAXParser
.
Although SAX is certainly a popular Java-based XML parser given its relatively long history, it has some serious competition from Sun, the makers of Java. The latest version of Java (J2SE 5.0) now includes an XML API called JAXP that serves as a built-in XML parser for Java. To learn more about JAXP, visit http://java.sun.com/xml/jaxp/
.
The main()
Method
Every command-line Java application begins its life with the main()
method. In the Java world, the main
method indicates that a class is a standalone program, as opposed to one that just provides functionality used by other classes. Perhaps more importantly, it's the method that gets run when you start the program. The purpose of this method is to set up the parser and get the name of the document to be parsed from the arguments passed in to the program. Here's the code:
public static void main(String[] args) { if (args.length == 0) { System.out.println("No XML document path specified."); System.exit(1); } DocumentPrinter dp = new DocumentPrinter(); XMLReader parser; try { parser = (XMLReader)Class.forName(PARSER_NAME).newInstance(); parser.setContentHandler(dp); parser.setErrorHandler(dp); parser.parse(args[0]); } // Normally it's a bad idea to catch generic exceptions like this. catch (Exception ex) { System.out.println(ex.getMessage()); ex.printStackTrace(); } }
This program expects that the user will specify the path to an XML document as its only command-line argument. If no such argument is submitted, the program will exit and instruct the user to supply that argument when running the program.
Next, the program creates an instance of the DocumentPrinter
object and assigns it to the variable dp
. You'll need this object later when you tell the parser which ContentHandler
and ErrorHandler
to use. After instantiating dp
, a try...catch
block is opened to house the parsing code. This is necessary because some of the methods called to carry out the parsing can throw exceptions that must be caught within the program. All of the real work in the program takes place inside the try
block.
The TRy...catch
block is the standard way in which Java handles errors that crop up during the execution of a program. It enables the program to compensate and work around those errors if the user chooses to do so. In this case, you simply print out information about the error and allow the program to exit gracefully.
Within the try...catch
block, the first order of business is creating a parser object. This object is actually an instance of the class named in the variable PARSER_NAME
. The fact that you're using it through the XMLReader
interface means that you can call only those methods included in that interface. For this application, that's fine. The class specified in the PARSER_NAME
variable is then loaded and assigned to the variable parser
. Because SAX 2.0 parsers must implement XMLReader
, you can refer to the interface as an object of that type rather than referring to the class by its own nameSAXParser
.
After the parser has been created, you can start setting its properties. Before actually parsing the document, however, you have to specify the content and error handlers that the parser will use. Because the DocumentPrinter
class can play both of those roles, you simply set both of those properties to dp
(the DocumentPrinter
object you just created). At this point, all you have to do is call the parse()
method on the URI passed in on the command line, which is exactly what the code does.
Implementing the ContentHandler
Interface
The skeleton for the program is now in place. The rest of the program consists of methods that fulfill the requirements of the ContentHandler
and ErrorHandler
interfaces. More specifically, these methods respond to events that are triggered during the parsing of an XML document. In this program, the methods just print out the content that they receive.
The first of these methods is the characters()
method, which is called whenever content is parsed in a document. Following is the code for this method:
public void characters(char[] ch, int start, int length) { String chars = ""; for (int i = start; i < start + length; i++) chars = chars + ch[i]; if ((chars.trim()).length() > 0) System.out.println("Received characters: " + chars); }
The characters()
method receives content found within elements. It accepts three arguments: an array of characters, the position in the array where the content starts, and the amount of content received. In this method, a for
loop is used to extract the content from the array, starting at the position in the array where the content starts, and iterating over each element until the position of the last element is reached. When all of the characters are gathered, the code checks to make sure they aren't just empty spaces, and then prints the results if not.
It's important not to just process all of the characters in the array of characters passed in unless that truly is your intent. The array can contain lots of padding on both sides of the relevant content, and including it all will result in a lot of extra characters along with the content that you actually want. On the other hand, if you know that the code contains parsed character data (PCDATA) that you want to read verbatim, then by all means process all of the characters.
The next two methods, startDocument()
and endDocument()
, are called when the beginning and end of the document are encountered, respectively. They accept no arguments and are called only once each during document parsing, for obvious reasons. Here's the code for these methods:
public void startDocument() { System.out.println("Start document."); } public void endDocument() { System.out.println("End of document reached."); }
Next let's look at the startElement()
and endElement()
methods, which accept the most complex set of arguments of any of the methods that make up a ContentHandler
:
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) { System.out.println("Start element: " + localName); } public void endElement(String namespaceURI, String localName, String qName) { System.out.println("End of element: " + localName); }
The startElement()
method accepts four arguments from the parser. The first is the namespace URI, which you'll see elsewhere as well. The namespace URI is the URI for the namespace associated with the element. If a namespace is used in the document, the URI for the namespace is provided in a namespace declaration. The local name is the name of the element without the namespace prefix. The qualified name is the name of the element including the namespace prefix if there is one. Finally, the attributes are provided as an instance of the Attributes
object. The endElement()
method accepts the same first three arguments but not the final attributes argument.
SAX parsers must have namespace processing turned on in order to populate all of these attributes. If that option is deactivated, any of the arguments (other than the attributes) may be populated with empty strings. The method for turning on namespace processing varies depending on which parser you use.
Let's look at attribute processing specifically. Attributes are supplied to the startElement()
method as an instance of the Attributes
object. In the sample code, you use three methods of the Attributes
object: getLength(), getLocalName()
, and getValue()
. The getLength()
method is used to iterate over the attributes supplied to the method call, while getLocalName()
and getValue()
accept the index of the attribute being retrieved as arguments. The code retrieves each attribute and prints out its name and value. In case you're curious, the full list of methods for the Attributes
object appears in Table 17.1.
Table 17.1. Methods of the Attributes
Object
Method |
Purpose |
---|---|
|
Retrieves an attribute's index using its qualified name |
|
Retrieves an attribute's index using its namespace URI and the local portion of its name |
|
Returns the number of attributes in the element |
|
Returns the local name of the attribute associated with the index |
|
Returns the qualified name of the attribute associated with the index |
|
Returns the type of the attribute with the supplied index |
|
Looks up the type of the attribute with the namespace URI and name specified |
|
Looks up the namespace URI of the attribute with the index specified |
|
Looks up the value of the attribute using the index |
|
Looks up the value of the attribute using the qualified name |
|
Looks up the value of the attribute using the namespace URI and local name |
Getting back to the endElement()
method, its operation is basically the same as that of startElement()
except that it doesn't accept the attributes of the element as an argument.
The next two methods, startPrefixMapping()
and endPrefixMapping()
, have to do with prefix mappings for namespaces:
public void startPrefixMapping(String prefix, String uri) { System.out.println("Prefix mapping: " + prefix); System.out.println("URI: " + uri); } public void endPrefixMapping(String prefix) { System.out.println("End of prefix mapping: " + prefix); }
These methods are used to report the beginning and end of namespace prefix mappings when they are encountered in a document.
The next method, ignorableWhitespace()
, is similar to characters()
, except that it returns whitespace from element content that can be ignored.
public void ignorableWhitespace(char[] ch, int start, int length) { System.out.println("Received whitespace."); }
Next on the method agenda is processingInstruction()
, which reports processing instructions to the content handler. For example, a stylesheet can be associated with an XML document using the following processing instruction:
<?xml-stylesheet href="mystyle.css" type="text/css"?>
The method that handles such instructions is
public void processingInstruction(String target, String data) { System.out.println("Received processing instruction:"); System.out.println("Target: " + target); System.out.println("Data: " + data); }
The last method you need to be concerned with is setDocumentLocator()
, which is called when each and every event is processed. Nothing is output by this method in this program, but I'll explain what its purpose is anyway. Whenever an entity in a document is processed, the parser calls setDocumentLocator()
with a Locator
object. The Locator
object contains information about where in the document the entity currently being processed is located. Here's the "do nothing" source code for the method:
public void setDocumentLocator(Locator locator) { }
The methods of a Locator
object are described in Table 17.2.
Table 17.2. The Methods of a Locator
Object
Method |
Purpose |
---|---|
|
Returns the column number of the current position in the document being parsed |
|
Returns the line number of the current position in the document being parsed |
|
Returns the public identifier of the current document event |
|
Returns the system identifier of the current document event |
Because the sample program doesn't concern itself with the specifics of locators, none of these methods are actually used. However, it's good for you to know about them in case you need to develop a program that somehow is interested in locators.
Implementing the ErrorHandler
Interface
I mentioned earlier that the DocumentPrinter
class implements two interfaces, ContentHandler
and ErrorHandler
. Let's look at the methods that are used to implement the ErrorHandler
interface. There are three types of errors that a SAX parser can generateerrors, fatal errors, and warnings. Classes that implement the ErrorHandler
interface must provide methods to handle all three types of errors. Here's the source code for the three methods:
public void error(SAXParseException exception) { } public void fatalError(SAXParseException exception) { } public void warning(SAXParseException exception) { }
As you can see, each of the three methods accepts the same argumenta SAXParseException
object. The only difference between them is that they are called under different circumstances. To keep things simple, the sample program doesn't output any error notifications. For the sake of completeness, the full list of methods supported by SAXParseException
appears in Table 17.3.
Table 17.3. Methods of the SAXParseException
Interface
Method |
Purpose |
---|---|
|
Returns the column number of the current position in the document being parsed |
|
Returns the line number of the current position in the document being parsed |
|
Returns the public identifier of the current document event |
|
Returns the system identifier of the current document event |
Similar to the Locator
methods, these methods aren't used in the Document Printer sample program, so you don't have to worry about the ins and outs of how they work.
Testing the Document Printer Program
Now that you understand how the code works in the Document Printer sample program, let's take it for a test drive one more time. This time around, you're running the program to parse the condos.xml
sample document from the previous tutorial. Here's an excerpt from that document in case it's already gotten a bit fuzzy in your memory:
<proj status="active"> <location lat="36.122238" long="-86.845028" /> <description> <name>Woodmont Close</name> <address>131 Woodmont Blvd.</address> <address2>Nashville, TN 37205</address2> <img>condowc.jpg</img> </description> </proj>
And here's the command required to run this document through the Document Printer program:
java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter condos.xml
Finally, Listing 17.2 contains the output of the Document Printer program after feeding it the condominium map data stored in the condos.xml
document.
Listing 17.2. The Output of the Document Printer Example Program After Processing the condos.xml
Document
1: Start document. 2: Start element: projects 3: Start element: proj 4: Start element: location 5: End of element: location 6: Start element: description 7: Start element: name 8: Received characters: Woodmont Close 9: End of element: name 10: Start element: address 11: Received characters: 131 Woodmont Blvd. 12: End of element: address 13: Start element: address2 14: Received characters: Nashville, TN 37205 15: End of element: address2 16: Start element: img 17: Received characters: condowc.jpg 18: End of element: img 19: End of element: description 20: End of element: proj 21: ... 22: Start element: proj 23: Start element: location 24: End of element: location 25: Start element: description 26: Start element: name 27: Received characters: Harding Hall 28: End of element: name 29: Start element: address 30: Received characters: 2120 Harding Pl. 31: End of element: address 32: Start element: address2 33: Received characters: Nashville, TN 37215 34: End of element: address2 35: Start element: img 36: Received characters: condohh.jpg 37: End of element: img 38: End of element: description 39: End of element: proj 40: End of element: projects 41: End of document reached.
The excerpt from the condos.xml
document that you saw a moment ago corresponds to the first proj
element in the XML document. Lines 3 through 20 show how the Document Printer program parses and displays detailed information for this element and all of its content.