DTD Construction Basics : XML

<?xml version="1.0"?>
<!DOCTYPE talltales SYSTEM "talltales.dtd">

The second line in this code is the document type declaration for the Tall Tales document you saw earlier. The main thing to note about this code is how the Tall Tales DTD is specified.

The terminology surrounding DTDs and document type declarations is admittedly confusing, so allow me to clarify that DTD stands for document type definition and contains the actual description of a markup language. A document type declaration is a line of code in a document that identifies the DTD. So, the big distinction here is that the definition (DTD) describes your markup language, whereas the declaration associates it with a document. Got it? Let's move on!

A DTD describes vital information about the structure of a document using markup declarations, which are lines of code that describe elements, attributes, and the relationship between them. The following types of markup declarations may be used within a DTD:

The elements allowed in the document
The attributes that may be assigned to each element
Entities that are allowed in the document
Notations that are allowed for use with external entities

Elements and attributes you know about, but the last two markup declarations relate to entirely new territory. Don't worry because you learn more about entities and notations in the next tutorial, "Digging Deeper into XML Documents." For now, it's important to understand that the markup declarations within a DTD serve a vital role in allowing documents to be validated against the DTD.

When associating a DTD with a document, there are two approaches the document type declaration can take:

It can directly include markup declarations in the document that form the internal DTD.
It can reference external markup declarations that form the external DTD.

These two approaches to declaring a DTD reveal that there are two parts to a DTD: an internal part and an external part. When you refer to the DTD for a document, you are actually referring to the internal and external DTDs taken together. The reason for breaking the DTD into two parts has to do with flexibility. The external DTD typically describes the general structure for a class of documents, whereas the internal DTD is specific to a given document. XML gives preference to the internal DTD, which means you can use it to override declarations in the external DTD.

By the Way

If you have any experience with CSS (Cascading Style Sheets) in web design, you may recognize a similar structure in DTDs where the internal DTD overrides the external DTD. In CSS style sheets, local styles override any external style sheets.

Breaking Down a DTD

The following code shows the general syntax of a document type declaration:

<!DOCTYPE RootElem SYSTEM ExternalDTDRef [InternalDTDDecl]>

By the Way

You could argue that it isn't necessary to understand the inner workings of DTDs in order to use XML, and to some extent that is true. In fact, you don't necessarily have to know anything about schemas to do interesting things with XML. However, it's impossible to truly understand the XML technology without having a firm grasp on what constitutes an XML-based markup language. And, of course, XML-based markup languages are described using DTDs and other types of schemas.

The external DTD is referenced by ExternalDTDRef, which is the URI (Uniform Resource Identifier) of a file containing the external DTD. The internal DTD corresponds to InternalDTDDecl and is declared between square brackets ([]). In addition to the internal and external DTDs, another very important piece of information is mentioned in the document type declaration: the root element. RootElem identifies the root element of the document class in the document type declaration syntax. The word SYSTEM indicates that the DTD is located in an external file. Following is an example of a document type declaration that uses both an internal and external DTD:

<!DOCTYPE talltales SYSTEM "TallTales.dtd"> [
<!ELEMENT question (#PCDATA)> ]>

By the Way

A URI (Uniform Resource Identifier) is a more general form of a URL (Uniform Resource Locator) that allows you to identify network resources other than files. URLs should be familiar to you as they are used to represent the addresses of web pages.

This code shows how you might create a document type declaration for the Tall Tales trivia sample document. The root element of the document is talltales, which means that all documents of this type must have their content housed within the talltales element. The document type declaration references an external DTD stored in the file TallTales.dtd. Additionally, an element named question is declared as part of the internal DTD. Remember that internal markup declarations always override external declarations of the same name if such declarations exist. It isn't always necessary to use an internal DTD if the external DTD sufficiently describes a language, which is often the case.

By the Way

The document type declaration must appear after the XML declaration but before the first element (tag) in a document.

In the previous tutorial you learned about the XML declaration, which must appear at the beginning of a document and indicates what version of XML is being used. The XML declaration can also contain additional pieces of information that relate to DTDs. I'm referring to the standalone status and character encoding of a document. The standalone status of a document determines whether or not a document relies on any external information sources, such as an external DTD. You can explicitly set the standalone status of a document using the standalone document declaration, which looks like an attribute of the XML declaration:

<?xml version="1.0" standalone="no"?>

By the Way

You learn about the character encoding of a document in the next tutorial, "Digging Deeper into XML Documents."

A value of yes for standalone indicates that a document is standalone and therefore doesn't rely on external information sources. A value of no indicates that a document is not standalone and therefore may rely on external information sources. Documents that rely on an external DTD for validation can't be considered standalone, and must have standalone set to no. For this reason, no is the default value for standalone.

Pondering Elements and Attributes

The primary components described in a DTD are elements and attributes. Elements and attributes are very important because they establish the logical structure of XML documents. You can think of an element as a logical unit of information, whereas an attribute is a characteristic of that information. This is an important distinction because there is often confusion over when to model information using an element versus using an attribute.

A useful approach to take when assessing the best way to model information is to consider the type of the information and how it will be used. Attributes provide tighter constraints on information, which can be very helpful. More specifically, attributes can be constrained against a predefined list of possible values and can also have default values. Element content is very unconstrained and is better suited for storing long strings of text and other child elements. Consider the following list of advantages that attributes offer over elements:

Attributes can be constrained against a predefined list of enumerated values.
Attributes can have default values.
Attributes have data types, although admittedly somewhat limited.
Attributes are very concise.

Attributes don't solve every problem, however. In fact, they are limited in several key respects. Following are the major disadvantages associated with attributes:

Attributes can't store long strings of text.
Attributes can't contain nested information.
Whitespace can't be ignored in an attribute value.

Given that attributes are simpler and more concise than elements, it's reasonable that you should use attributes over child elements whenever possible. Fortunately, the decision to use child elements is made fairly straightforward by the limitations of attributes: if a piece of information is a long string of text, requires nested information within it, or requires whitespace to be ignored, you'll want to place it in an element. Otherwise, an attribute is probably your best choice. Of course, regardless of how well your document data maps to attributes, it must have at least a root element.