XML

Inside the Google Sitemap Protocol

Enough background information, let's look at some XML code! The language behind Google Sitemaps is an XML-based language called the Sitemap protocol. The Sitemap protocol is very simple, and only consists of six different tags. Following is a list of these tags and their meaning in the context of a Sitemap document:

  • <urlset> The root element of a Sitemap document, which serves as a container for individual <url> elements

  • <url> The storage unit for an individual URL within a Sitemap; serves as a container for the <loc>, <lastmod>, <changefreq>, and <priority> elements

  • <loc> The URL of a discrete page within a Sitemap; this tag is required

  • <lastmod> The date/time of the last change to this web page; this tag is optional

  • <changefreq> An estimate of how frequently the content in the web page changes; this tag is optional

  • <priority> The priority of the web page with respect to other pages on this site; this tag is optional

The previous section covered the information associated with a Sitemap, which these tags match up with very closely. In other words, these shouldn't come as too terribly much of a surprise given that you already knew a Sitemap is described by a URL, last modification date, change frequency, and priority ranking. However, it helps to see the tags in context to get a better feel for how a Sitemap is structured. Following is the code for a minimal Sitemap with one URL entry:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
  http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
  <url>
    <loc>http://www.xyz.com/</loc>
    <lastmod>2005-08-23</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

The messiest part of this code is the <urlset> tag, which includes several namespace declarations, as well as a reference to a schema for the document; more on Sitemap schemas later in the section titled, "Validating Your Sitemap." As you can see, the <urlset> tag serves as the root of the document, and therefore contains all of the other tags in the document. The Sitemap namespace is first declared in the <urlset> tag, followed by the XMLSchema namespace and then the Sitemaps schema itself. All of this namespace/schema stuff is boilerplate code that will appear in every Sitemap document. Of course, if you don't feel the need to validate your Sitemap, you can leave out the schema code.

Within the <urlset> tag, the real work starts to take place. In this example there is only one URL, as indicated by the solitary <url> tag. The <url> tag is used to house the child tags that describe each URL in your web site. These child tags are <loc>, <lastmod>, <changefreq>, and <priority>, and together they describe a single URL for a Sitemap. Let's take a quick look at the details of each tag and how they are used.

The <loc> tag represents the location of a web page in the form of a URL. This URL must be a full URL complete with http:// at the start, or https:// in the case of secure pages. If your web server requires a trailing slash on domain paths, make sure to include it here, as in http://www.xyz.com/. Also keep in mind that the contents of the <loc> tag can't be longer than 2,048 characters. That would be a ridiculously lengthy URL, so I doubt you will encounter a problem with this upper limit on URL length. The only other thing to note about the <loc> tag is that it is the only required child tag of <url>, which means you can feasibly create a Sitemap with nothing more than the <urlset>, <url>, and <loc> tags. But why would you want to do that when you can use the other optional tags to give Google even more assistance in crawling your pages?

Each Sitemap is limited to 50,000 URLs and a 10MB compressed file size. Google Sitemaps supports compressed Sitemap files that are compressed with the gzip tool, which is freely available at http://www.gzip.org/.

The <lastmod> tag is used to identify the last time a page was modified, which can let the web crawler know if it needs to reindex the page based upon the modification date of the last indexed version of the page. The content in the <lastmod> tag must be formatted as an ISO 8601 date/time value. In practical terms, this means you can code it simply as a date with the following format: YYYY-MM-DD. In other words, you can omit the time if you want. If you do elect to include the time, the entire date/representation is typically expressed as YYYY-MM-DDThh:mm:ss. The letter T in the middle of the date/time is just a separator. Of course, you may already be asking yourself how the time zone factors into this format. You add the time zone onto the end of the time as a +/- offset relative to UTC (Coordinated Universal Time) in the form hh or hh:mm if you're dealing with a half-hour time zone. Following is an example of a complete date/time in the central time zone (CST), which is GMT minus five hours:

2005-10-31T15:43:22-05:00

In this example, the date is October 31, 2005, and the time is 3:43:22 p.m. CST. Notice that 24-hours time is specified in order to make the distinction between a.m. and p.m.

ISO 8061 is an international date/time standard for representing a date and time as plain text. ISO 8061 supports a wider range of date/time formats than what I've explained here. Feel free to learn more about ISO 8061 online if you feel the urge to know more: http://www.w3.org/TR/NOTE-datetime.

The <changefreq> tag is used to specify how often the content on a web page changes. The change frequency of a page is obviously something that can't always be predicted and in many cases varies considerably. For this reason, you should think of the <changefreq> tag as providing a web crawler with a rough estimate of how often a page changes. Google makes no promises regarding how often it will crawl a page even if you set the change frequency to a very high value, so your best bet is to try and be realistic when determining the change frequency of your pages. Possible values for this tag include: always, hourly, daily, weekly, monthly, yearly, and never. The always value should only be used on pages that literally change every time they are viewed, while never is reserved for pages that are completely and permanently static (typically archived pages). The remaining values provide plenty of options for specifying how frequently a page changes.

The <priority> tag allows you to assign relative priorities to the pages on your web site. Although I mentioned it earlier, it's worth hammering home once more that this tag has nothing to do with a page's priority level as compared to other web sites, so please don't think of it as a way to boost your site as a whole. The significance of a priority ranking in this case is to help identify URLs on your web site that are more important than other URLs on your web site. In theory, this may help a web crawler isolate the most important pages on your site when targeting search results. Values for the <priority> tag range from 0.0 to 1.0, which 1.0 being the highest priority and 0.0 being the lowest. Generally speaking, you should rank average pages as 0.5, the most important pages as 1.0, and the least important pages as 0.0; this tag defaults to 0.5 if you don't specify it. Feel free to use values in between those I just suggested if you think you can assess the relative importance of pages on your site to that degree.

It will do you no good to set a high priority for all of your pages, as the end result will provide Google with no basis for determine which of your pages you think are more important than others.

The restrictions on sitemap files are modest. URLs must not include embedded newlines; you must fully specify URLs because Google tries to crawl the URLs exactly as you provide them. Your sitemap files must use UTF-8 encoding. And, each sitemap file is limited to 50,000 URLs and 10MB when uncompressed.