The Wonderful World of Entities : XML

Although most entities have names, a few notable ones do not. The document entity, which is the top-level entity for a document, does not have a name. The document entity is important because it serves as a storage container for the entire document. This entity is then broken down into subentities, which are often broken down further. The breakdown of entities ends when you arrive at nothing but pure content. The other entity that goes unnamed is the external DTD (Document Type Definition), if one exists. You learned in the previous tutorial that a DTD is used to describe the format and structure of documents created in a specific XML-based language. For example, the DTD for HTML would specify exactly how the <img> tag is used to mark up images.

Getting back to entities, the TallTales.dtd external DTD you saw in the previous tutorial is an entity, as is the root document element talltales in the talltales.xml document. Following is an excerpt from the talltales.xml document that illustrates the relationship between the external DTD and the root document element:

<?xml version="1.0"?>
<!DOCTYPE talltales SYSTEM "TallTales.dtd">
<talltales>
  <!-- Document markup -->
</talltales>

In this code, the talltales root element and the TallTales.dtd external DTD are referenced in the document type declaration. To clarify how entities are storage constructs, consider the fact that the contents of the external DTD could be directly inserted in the document type declaration, in which case it would no longer be considered an entity. What makes the DTD an entity is the fact that its storage is external. A good analogy to this concept is a JPEG image on an HTML web page. The image itself is stored externally in a JPEG file, and then referenced from the web page via an <img> tag. Because the storage of the image is external to the HTML document, the image is considered an entity. The talltales.dtd schema in the previous example works in a similar way because it is referenced externally from talltales.xml.

The XML document entity is a unique entity in that it is declared in the document type declaration. Most other entities are declared in an entity declaration, which must appear before the entities can be used in the document. An entity declaration consists of a unique entity name and a piece of data that is associated with the name. Following are some examples of data that you might reference as entities:

A string of text (a boilerplate copyright notice, for example)
A section of the DTD
An external file containing text data (a list of email addresses, for example)
An external file containing binary data (a GIF or JPEG image, for example)

This list reveals that entities are extremely flexible when it comes to the types of data that can be stored in them. Although the specific data within an entity can certainly vary widely, there are two basic types of entities that are used in XML documents: parsed entities and unparsed entities. Parsed entities store data that is parsed (processed) by an XML application, which means that parsed entities can contain only text. Unparsed entities aren't parsed and therefore can be either text or binary data. As an example, a text name would be a parsed entity whereas a JPEG image would be an unparsed entity.

Parsed entities end up being merged with the contents of a document when they are processed. In other words, parsed entities are directly inserted into documents as if they were directly part of the document to begin with. Unparsed entities cannot be handled in this manner because XML applications are unable to parse them. Going back to the JPEG image example, consider the difficulty of combining a binary JPEG image with the text content of an HTML document. Because binary data and text data don't mix well, unparsed entities are never merged directly with a document.

If an XML application can't process and merge an unparsed entity with the rest of a document, how does it use the entity as document data? The answer to this question lies in notations, which are XML constructs used to identify the entity type to an XML processor. In addition to identifying the type of an unparsed entity, a notation also specifies a helper application that can be used to process the entity. A good example of a helper application for an unparsed entity is an image viewer, which would be associated with a GIF or JPEG image entity or maybe a lesser-used image format such as TIFF. The point is that a notation tells an XML application how to handle an unparsed entity using a helper application.

If you've ever had your web browser prompt you for a plug-in to view a special content type such as an Adobe Acrobat file (.PDF document), you understand how important helper applications can be.

Parsed Entities

You learned that parsed entities are entities containing XML data that is processed by an XML application. There are two fundamental types of parsed entities:

General entities
Parameter entities

The next couple of sections explore these types of entities in more detail.

General Entities

General entities are parsed entities that are designed for use in document content. If you have a string of text that you'd like to isolate as a piece of reusable document data, a general entity is exactly what you need. A good example of such a reusable piece of text is the copyright notice for a web site, which appears the same across all pages. Before you can reference a general entity in a document, you must declare it using a general entity declaration, which takes the following form:

<!ENTITY EntityName EntityDef>

The unique name of the entity is specified in EntityName, whereas its associated text is specified in EntityDef. All entity declarations must be placed in the DTD, although you can decide whether they go in the internal or external DTD. If an entity is used only in a single document, you can place the declaration in the internal DTD; otherwise you'll want to place it in the external DTD so it can be shared. Of course, if you're using an existing XML language you may be forced to include your entity declarations in the internal DTD. Following is an example of a general entity declaration:

<!ENTITY copyright "Copyright &#169;2005 Test Name.">

Just in case you've already forgotten from earlier in the tutorial, the © character reference in the code is the copyright symbol.

You are now free to use the entity anywhere in the content of a document. General entities are referenced in document content using the entity name sandwiched between an ampersand (&) and a semicolon (;), as the following form shows:

&EntityName;

Following is an example of referencing the copyright entity:

My Life Story.
&copyright;
My name is Test Name and this is my story.

In this example, the contents of the copyright entity are replaced in the text where the reference occurs, in between the title and the sentence. The copyright entity is an example of a general entity that you declare yourself. This is how most entities are used in XML. However, there are a handful of predefined entities in XML that you can use without declaring. I'm referring to the five predefined entities that correspond to special characters, which you learned about back in Creating XML Documents. Table 4.2 lists some of the entities just in case you don't quite remember them.

Table 4.2. Common Entities

Character	Entity
Less-than symbol (`<`)	`<`
Greater-than symbol (`>`)	`>`
Quote symbol (`"`)	`"`
Apostrophe symbol (`'`)	`'`
Ampersand symbol (`&`)	`&`

These predefined entities serve as an exception to the rule of having to declare all entities before using them; beyond these five entities, all entities must be declared before being used in a document.

Parameter Entities

The other type of parsed entity supported in XML is the parameter entity, which is a general entity that is used only within a DTD. Parameter entities are used to help modularize the structure of DTDs by allowing you to store commonly used pieces of declarations. For example, you might use a parameter entity to store a list of commonly used attributes that are shared among multiple elements. As with general entities, you must declare a parameter entity before using it in a DTD. Parameter entity declarations have the following form:

<!ENTITY % EntityName EntityDef>

Parameter entity declarations are very similar to general entity declarations, with the only difference being the presence of the percent sign (%) and the space on either side of it. The unique name of the parameter entity is specified in EntityName, whereas the entity content is specified in EntityDef. Following is an example of a parameter entity declaration:

<!ENTITY % autoelems "year, make, model">

This parameter entity describes a portion of a content model that can be referenced within elements in a DTD. Keep in mind that parameter entities apply only to DTDs. Parameter entities are referenced using the entity name sandwiched between a percent sign (%) and a semicolon (;), as the following form shows:

%EntityName;

Following is an example of referencing the autoelems parameter entity in a hypothetical automotive DTD:

<!ELEMENT car (%autoelems;)>
<!ELEMENT truck (%autoelems;)>
<!ELEMENT suv (%autoelems;)>

This code is equivalent to the following:

<!ELEMENT car (year, make, model)>
<!ELEMENT truck (year, make, model)>
<!ELEMENT suv (year, make, model)>

It's important to note that parameter entities really come into play only when you have a large DTD with repeating declarations. Even then you should be careful how you modularize a DTD with parameter entities because it's possible to create unnecessary complexity if you layer too many parameter entities within each other.

Unparsed Entities

Unparsed entities aren't processed by XML applications and are capable of storing text or binary data. Because it isn't possible to embed the content of binary entities directly in a document as text, binary entities are always referenced from an external location, such as a file. Unlike parsed entities, which can be referenced from just about anywhere in the content of a document, unparsed entities must be referenced using an attribute of type ENTITY or ENTITIES. Following is an example of an unparsed entity declaration using the ENTITY attribute:

<!ELEMENT player EMPTY>
<!ATTLIST player
  name CDATA #REQUIRED
  position CDATA #REQUIRED
  photo ENTITY #IMPLIED>

In this code, photo is specified as an attribute of type ENTITY, which means that you can assign an unparsed entity to the attribute. Following is an example of how this is carried out in document content:

<player name="Rolly Fingers" position="pitcher" photo="rfpic" />

In this code, the binary entity rfpic is assigned to the photo attribute. Even though the binary entity has been properly declared and assigned, an XML application won't know how to handle it without a notation declaration, which you find out about a little later in the tutorial.

Internal Versus External Entities

The physical location of entities is very important in determining how entities are referenced in XML documents. Thus far I've made the distinction between parsed and unparsed entities. Another important way to look at entities is to consider how they are stored. Internal entities are stored within the document that references them and are parsed entities out of necessity. External entities are stored outside of the document that references them and can be either parsed or unparsed.

By definition, any entity that is not internal must be external. This means that an external entity is stored outside of the document where the entity is declared. A good example of an external entity is a binary image; images are always stored in separate files from the documents that reference them.

Unparsed external entities are often binary files such as images, which obviously cannot be parsed by an XML processor. Unparsed external entities are identified using the NDATA keyword in their entity declaration; NDATA (Not DATA) simply indicates that the entity's content is not XML data.

External entity declarations are different than internal entity declarations because they must reference an external storage location. Files associated with external entities can be specified in one of two ways, depending on whether the file is located on the local file system or is publicly available on a network:

SYSTEM The file is located on the local file system or on a network
PUBLIC The file is a public-domain file located in a publicly accessible place

Watch Out!

When specifying the location of external entities, you must always use the SYSTEM keyword to identify a file on a local system or network; the PUBLIC keyword is optional and is used in addition to the SYSTEM keyword.

The file for an external entity is specified as a URI, which is very similar to the more familiar URL. You can specify files using a relative URI, which makes it a little easier than listing a full path to a file. XML expects relative URIs to be specified relative to the document within which an entity is declared. Following is an example of declaring a JPEG image entity using a relative URI:

<!ENTITY skate SYSTEM "pond.jpg" NDATA JPEG>

In this example, the pond.jpg file must be located on the local file system in the same directory as the file (document) containing the entity declaration. The NDATA keyword is used to indicate that the entity does not contain XML data. Also, the type of the external entity is specified as JPEG. Unfortunately, XML doesn't support any built-in binary entity types such as JPEG, even though JPEG is a widely known image format. You must use notations to establish entity types for unparsed entities.