XML

Characters of Text in XML

As you know, XML documents are made of text. More specifically, an XML document consists of characters of text that have meaning based upon XML syntax. The characters of text within an XML document can be encoded in a number of different ways to suit different human languages. All of the character-encoding schemes supported by XML are based on the Unicode text standard, which specifies the set of characters available for use in text documents. The character-encoding scheme for an XML document is determined within the XML declaration in a piece of code known as the character encoding declaration. The character encoding declaration looks like an attribute of the XML declaration, as the following code shows:

<?xml version="1.0" encoding="UTF-8"?>

The UTF-8 value assigned in the character encoding declaration specifies that the document is to use the Unicode UTF-8 character-encoding scheme, which is the default scheme for XML. All XML applications are required to support the UTF-8 and UTF-16 character encoding schemes; the difference between the two schemes is the number of bits used to represent each character of text (8 or 16). If you don't expect your documents to be used in a scenario with multiple human languages, you can probably stick to UTF-8. Otherwise, you'll need to go with UTF-16, which requires more memory but allows for multiple languages.

There are other character encoding standards in addition to UTF-8 and UTF-16, such as ISO-8859-1, which is used in Western Europe. You'll want to look into other character encoding options if you plan on developing XML documents that target languages other than English.

Regardless of the scheme you use to encode characters within an XML document, you need to know how to specify characters numerically. All characters in an encoding scheme have a numerical value associated with them that can be used as a character reference. Character references come in very handy when you're trying to enter a character that can't be typed on a keyboard. For example, the copyright symbol () is an example of a character that can only be specified using a character reference. There are two types of numeric character references:

  • Decimal reference (base 10)

  • Hexadecimal reference (base 16)

A decimal character reference relies on a decimal number (base 10) to specify a character's numeric value. Decimal references are specified using an ampersand followed by a pound sign (&#), the character number, and a semicolon (;). So, a complete decimal character reference has the following form:

&#Num;

The decimal number in this form is represented by Num. Following is an example of a decimal character reference:

&#169;

This character reference identifies the character associated with the decimal number 169, which just so happens to be the copyright symbol. Following is the copyright symbol character reference used within the context of other character data:

&#169;2005 Test Name

Even though the code looks a little messy with the character reference, you're using a symbol (the copyright symbol) that would otherwise be difficult to enter in a normal text editor since there is no copyright key on your keyboard.

The actual decimal number associated with the copyright symbol is determined by a standard that applies to both XML and HTML. To learn more about special characters that can be encoded using character references, please refer to the following web page: http://www.w3.org/TR/REC-html40/sgml/entities.html.

Table 4.1 lists some common character references you may find useful when developing XML documents of your own:

Table 4.1. Common Character References

Symbol

Character Reference

¾ ([3/4] (three-fourths))

&#190;

¼ ([1/4] (one-fourth))

&#188;

½ ([1/2] (one-half))

&#189;


There are many more character references that you can use to code obscure or otherwise difficult-to-enter characters. This list should give you a good start on using some of the more popular character references.

Thus far you've focused on the first approach to specifying characters numerically in XML, which involves using decimal character references. If you're coming from a programming background, you may opt for the second approach to specifying numeric characters: hexadecimal references. A hexadecimal reference uses a hexadecimal number (base 16) to specify a character's numeric value. Hexadecimal references are specified similarly to decimal references, except that an x immediately precedes the number:

&#xNum;

Using this form, the copyright character with the decimal value of 169 is referenced in hexadecimal as the following:

&#xA9;

Because decimal and hexadecimal references represent two equivalent solutions to the same problem (referencing characters), there is no technical reason to choose one over the other. However, most of us are much more comfortable working with decimal numbers because it's the number system used in everyday life. It ultimately has to do with your degree of comfort with each number system; the decimal system is probably much more familiar to you.

Quick Hexadecimal Primer

If you aren't naturally a binary thinker (few of us are), you might find hexadecimal numbers to be somewhat confusing. Hexadecimal numbers are strange looking because they use the letters AF to represent the numbers 1015. As an example, the decimal number 60 is 3C in hexadecimal; the C represents decimal 12, whereas the 3 represents decimal 48 (3 x 16); 12 plus 48 is 60. Most programming languages denote hexadecimal numbers by preceding them with an x, which was no doubt an influence on the XML representation of hexadecimal character references.