XML

A Quick XML Primer

You learned in the previous tutorial that XML is a markup language used to create other markup languages. Because HTML is a markup language, it stands to reason that XML documents should in some way resemble HTML documents. In fact, you saw in the previous tutorial how an XML document looks a lot like an HTML document, with the obvious difference that XML documents can use custom tags. So, instead of seeing <head> and <body> you saw <pet> and <friend>. Nonetheless, if you have some experience with coding web pages in HTML, XML will be very familiar. You will find that XML isn't nearly as lenient as HTML, so you may have to unlearn some bad habits carried over from HTML.

Of course, if you don't have any experience with HTML you probably won't even realize that XML is a somewhat rigid language.

XML Building Blocks

Like some other markup languages, XML relies heavily on three fundamental building blocks: elements, attributes, and values. An element is used to describe or contain a piece of information; elements form the basis of all XML documents. Elements consist of two tags: an opening tag and a closing tag. Opening tags appear as words contained within angle brackets (<>), such as <pet> or <friend>. Closing tags also appear within angle brackets, but they have a forward-slash (/) just before the tag name. Examples of closing tags are </pet> and </friend>. Elements always appear as an opening tag, followed by optional data, followed by a closing tag:

<pet>
</pet>

In this example, there is no data appearing between the opening and closing tags, which illustrates that the data is indeed optional. XML doesn't care too much about how whitespace appears between tags, so it's perfectly acceptable to place tags together on the same line:

<pet></pet>

Keep in mind that the purpose of tags is to denote pieces of information in an XML document, so it is rare to see a pair of tags with nothing between them, as the previous two examples show. Instead, tags typically contain text content or additional tags. Following is an example of how the pet element can contain additional content, which in this case is a couple of friend elements:

<pet>
  <friend />
  <friend />
</pet>

It's important to note that an element is a logical unit of information in an XML document, whereas a tag is a specific piece of XML code that comprises an element. That's why I always refer to an element by its name, such as pet, whereas tags are always referenced just as they appear in code, such as <pet> or </pet>.

You're probably wondering why this code broke the rule requiring that every element has to consist of both an opening and a closing tag. In other words, why do the friend elements appear to involve only a single tag? The answer to this question is that XML allows you to abbreviate empty elements. An empty element is an element that doesn't contain any content within its opening and closing tags. The earlier pet examples you saw are empty elements. Because empty elements don't contain any content between their opening and closing tags, you can abbreviate them by using a single tag known as an empty tag. Similar to other tags, an empty tag is enclosed by angle brackets (<>), but it also includes a forward slash (/) just before the closing angle bracket. So, the empty friend element, which would normally be coded as <friend></friend> can be abbreviated as <friend />. The space before the /> isn't necessary but is a standard style practice among XML developers.

Any discussion of opening and closing tags wouldn't be complete without pointing out a glaring flaw that appears in most HTML documents. I'm referring to the <p> tag, which is used to enclose a paragraph of text, and is often found in HTML documents with an opening tag but no closing tag. The p element in HTML is not an empty element, and therefore should always have a </p> closing tag, but most HTML developers make the mistake of leaving it out. This kind of freewheeling coding will get you in trouble quickly with XML!

All this talk of empty elements brings to mind the question of why you'd want to use an element that has no content. The reason for this is because you can still attach attributes to empty elements. Attributes are small pieces of information that appear within an element's opening tag. An attribute consists of an attribute name and a corresponding attribute value, which are separated by an equal symbol (=). The value of an attribute appears to the right of the equal symbol and must appear within quotes. Following is an example of an attribute named name that is associated with the friend element:

<friend name="Augustus" />

Attributes represent another area where HTML code is often in error, at least from the perspective of XML. HTML attributes are regularly used without quotes, which is a clear violation of XML syntax. Always quoting attribute values is another habit you'll need to learn if you're making the progression from free-spirited HTML designer to ruthlessly efficient XML coder.

In this example, the name attribute is used to identify the name of a friend. Attributes aren't limited to empty elements they are just as useful with nonempty elements. Additionally, you can use several different attributes with a single element. Following is an example of how several attributes are used to describe a pet in detail:

<pet name="Maximillian" type="pot bellied pig" age="3">

As you can see, attributes are a great way to tie small pieces of descriptive information to an element without actually affecting the element's content.

Inside an Element

A nonempty element is an element that contains content within its opening and closing tags. Earlier I mentioned that this content could be either text or additional elements. When elements are contained within other elements, they are known as nested elements. To understand how nested elements work, consider an apartment building. Individual apartments are contained within the building, whereas individual rooms are contained within each apartment. Within each room there may be pieces of furniture that in turn are used to store belongings. In XML terms, the belongings are nested in the furniture, which is nested in the rooms, which are nested in the apartments, which are nested in the apartment building. Listing 2.1 shows how the apartment building might be coded in XML.


Listing 2.1. An Apartment Building XML Example
 1: <apartmentbldg>
 2:   <apartment>
 3:     <room type="bedroom">
 4:       <furniture type="armoire">
 5:         <belonging type="t-shirt" color="navy" size="xl" />
 6:         <belonging type="sock" color="white" />
 7:         <belonging type="watch" />
 8:       </furniture>
 9:     </room>
10:   </apartment>
11: </apartmentbldg>

If you study the code, you'll notice that the different elements are nested according to their physical locations within the building. It's important to recognize in this example that the belonging elements are empty elements (lines 57), which is evident by the fact that they use the abbreviated empty tag ending in />. These elements are empty because they aren't required to house (no pun intended!) any additional information. In other words, it is sufficient to describe the belonging elements solely through attributes.

It's important to realize that nonempty elements aren't just used for nesting purposes. Nonempty elements often contain text content, which appears between the opening and closing tags. Following is an example of how you might decide to expand the belonging element so that it isn't empty:

<furniture type="desk">
  <belonging type="letter">
    Dear Customer,
    I am pleased to announce that you may have won our sweepstakes. You are
    one of the lucky finalists in your area, and if you would just purchase
    five or more magazine subscriptions then you may eventually win some
    money. Or not.
  </belonging>
</furniture>

In this example, my ticket to an early retirement appears as text within the belonging element. You can include just about any text you want in an element, with the exception of a few special symbols, which you learn about a little later in the tutorial.

XML's Five Commandments

Now that you have a feel for the XML language, it's time to move on and learn about the specific rules that govern its usage. I've mentioned already that XML is a more rigid language than HTML, which basically means that you have to pay attention when coding XML documents. In reality, the exacting nature of the XML language is actually a huge benefit to XML developers you'll quickly get in the habit of writing much cleaner code than you might have been accustomed to writing in HTML, which will result in more accurate and reliable code. The key to XML's accuracy lies in a few simple rules, which I'm calling XML's five commandments:

  1. Tag names are case sensitive.

  2. Every opening tag must have a corresponding closing tag (unless it is abbreviated as an empty tag).

  3. A nested tag pair cannot overlap another tag.

  4. Attribute values must appear within quotes.

  5. Every document must have a root element.

Admittedly, the last rule is one that I haven't prepared you for, but the others should make sense to you. First off, rule number one states that XML is a case-sensitive language, which means that <pet>, <Pet>, and <PET> are all different tags. If you're coming from the world of HTML, this is a very critical difference between XML and HTML. It's not uncommon to see HTML code that alternates back and forth between tags such as <b> and <B> for bold text. In XML, this mixing of case is a clear no-no. Generally speaking, XML standards encourage developers to use either lowercase tags or mixed case tags, as opposed to the uppercase tags commonly found in HTML web pages. The same rule applies to attributes. If you're writing XML code in a specific XML-based markup language, the language itself will dictate what case you should use for tags and attributes.

The second rule reinforces what you've already learned by stating that every opening tag (<pet>) must have a corresponding closing tag (</pet>). In other words, tags must always appear in pairs (<pet></pet>). Of course, the exception to this rule is the empty tag, which serves as an abbreviated form of a tag pair (<pet />). Rule three continues to nail down the relationship between tags by stating that tag pairs cannot overlap each other. This really means that a tag pair must be completely nested within another tag pair. Perhaps an example will better clarify this point:

<pets>
  <pet name="Maximillian" type="pot bellied pig" age="3">
  </pet>
  <pet name="Augustus" type="goat" age="2">
</pets>
  </pet>

Do you see the problem with this code? The problem is that the second pet element isn't properly nested within the pets element. The code indentation helps to make the problem more apparent but this isn't always the case. For example, consider the following version of the same code:

<pets>
  <pet name="Maximillian" type="pot bellied pig" age="3">
  </pet>
  <pet name="Augustus" type="goat" age="2">
  </pets>
</pet>

Remembering that whitespace doesn't normally affect the structure of XML code, this listing is functionally no different than the previous listing but the nesting problem is much more hidden. In other words, the second pet element is still split out across the pets element, which is wrong. This code is wrong because it is no longer clear whether the second pet element is intended to be nested within the pets element or notthe relationship between the elements is ambiguous. And XML despises ambiguity! To resolve the problem you must either move the closing </pet> tag so that it is enclosed within the pets element, or move the opening <pet> tag so that it is outside of the pets element.

Getting back to the XML commandments, rule number four reiterates the earlier point regarding quoted attribute values. It simply means that all attribute values must appear in quotes. So, the following code breaks this rule because the name attribute value Maximillian doesn't appear in quotes:

<friend name=Maximillian />

As I mentioned earlier, if you have used HTML this is one rule in particular that you will need to remember as you begin working with XML. Most web page designers are very inconsistent in their usage of quotes with attribute values. In fact, the tendency is to not use them. XML requires quotes around all attribute values, no questions asked!

The last XML commandment is the only one that I haven't really prepared you for because it deals with an entirely new concept: the root element. The root element is the single element in an XML document that contains all other elements in the document. Every XML document must contain a root element, which means that exactly one element must be at the top level of any given XML document. In the "pets" example that you've seen throughout this tutorial and the previous tutorial, the pets element is the root element because it contains all the other elements in the document (the pet and friend elements). To make a quick comparison to HTML, the html element in a web page is the root element, so HTML adheres to XML rules in this regard. However, technically HTML will let you get away with having more than one root element, whereas XML will not.

Because the root element in an XML document must contain other elements, it cannot be an empty element. This means that the root element must always consist of a pair of opening and closing tags, and can never be shortened to an empty tag a la />.

Special XML Symbols

There are a few special symbols in XML that must be entered differently than other text characters when appearing as content within an XML document. The reason for entering these symbols differently is because they are considered part of XML syntax by identifying parts of an XML document such as tags and attributes. The symbols to which I'm referring are the less than symbol (<), greater than symbol (>), quote symbol ("), apostrophe symbol ('), and ampersand symbol (&). These symbols all have special meaning within the XML language, which is why you must enter them using a symbol instead of just using each character directly. So, as an example, the following code isn't allowed in XML because the apostrophe (') character is used directly in the name attribute value:

<movie name="All the King's Men" />

The trick to referencing these characters is to use special predefined symbols known as entities. An entity is a symbol that identifies a resource, such as a text character or even a file. There are five predefined entities in XML that correspond to each of the special characters you just learned about. Entities in XML begin with an ampersand (&) and end with a semicolon (;), with the entity name sandwiched between. Following are the predefined entities for the special characters:

  • Less than symbol (<)&lt;

  • Greater than symbol (>)&gt;

  • Quote symbol (")&quot;

  • Apostrophe symbol (')&apos;

  • Ampersand symbol (&)&amp;

To fix the movie example code, just replace the ampersand and apostrophe characters in the attribute value with the appropriate entities:

<movie name="All the King&apos;s Men" />

Here's another movie example, just to clarify how another of the entities is used:

<movie name="Pride &amp; Prejudice" />

In this example, the &amp; entity is used to help code the movie title, "Pride & Prejudice." Admittedly, these entities make the attribute values a little tougher to read, but there is no question regarding the usage of the characters. This is a good example of how XML is willing to make a trade-off between ease of use on the developer's part and technical clarity. Fortunately, there are only five predefined entities to deal with, so it's pretty easy to remember them.

The XML Declaration

One final important topic to cover in this quick tour of XML is the XML declaration, which is not strictly required of all XML documents but is a good idea nonetheless. The XML declaration is a line of code at the beginning of an XML document that identifies the version of XML used by the document. Currently there are two versions of XML: 1.0 and 1.1. XML 1.1 primarily differs from XML 1.0 in how it supports characters in element and attribute names. XML 1.1's broader character support is primarily of use for mainframe programmers, which likely explains why XML 1.1 isn't very widely supported. Given this scenario, you should only worry about supporting XML 1.0 in your documents, at least for the foreseeable future.

There have been some rumblings in the XML community about a possible XML 2.0 but nothing concrete has materialized as of yet.

Getting back to the XML declaration, it notifies an application or web browser of the XML version that an XML document is using, which can be very helpful in processing the document. Following is the standard XML declaration for XML 1.0:

<?xml version="1.0"?>

This code looks somewhat similar to an opening tag for an element named xml with an attribute named version. However, the code isn't actually a tag at all. Instead, this code is known as a processing instruction, which is a special line of XML code that passes information to the application that is processing the document. In this case, the processing instruction is notifying the application that the document uses XML 1.0. Processing instructions are easily identified by the <? and ?> symbols that sandwich each instruction.

As of XML 1.1, all XML documents are required to include an XML declaration.