Web Crawling Basics : XML

A web crawler is an automated program that browses pages on the Web according to a certain algorithm. The simplest algorithm is to simply open and follow every link on a page, and then open and follow every link on subsequent pages, and on and on. Web crawlers are typically used by search engines to index web pages for faster and more accurate searching. You can think of a web crawler as a little worker bee that is constantly out there buzzing from link to link on every web page reporting information back to a database that is part of a search engine. Pretty much all major search engines use web crawlers of one form or another.

The specific algorithms employed by web crawlers are often shrouded in secrecy, as search engine developers attempt to one-up each other in regard to accuracy and efficiency. So far Google appears to be out in front, at least when you consider how far-reaching their search results go as compared to other search engines. Even so, this is an ongoing battle that will likely be waged for a long time, so the main players could certainly change over time. In fact, Microsoft recently entered the fray with its own MSN Search service.

Getting back to web crawlers, they are important to web developers because you actually want public sites to be crawled as frequently as possible to ensure that search engines factor in your most recent web content. Up to this point, you pretty much had to cross your fingers and hope for the best when it came to a search engine's web crawler eventually getting around to crawling your site again and updating its indexes accordingly. Keep in mind that I'm not talking about your site's search ranking per seI'm talking more about how a search engine goes about crawling your site. Of course, the content within your web pages certainly affects the site's rankings but having your site crawled more regularly doesn't necessarily have anything to do with a ranking improvement.

To understand the relationship between a web crawler and a page's ranking, consider a blog page where you discuss the inner workings of a diesel engine in detail. When this page is crawled by a search engine, it may move up the rankings for searches related to engines. Now let's say a week later your topic of choice is movies by Steven Spielberg. If the page is crawled again, the content will dictate that it isn't such a good match for engine searches, and will instead be realigned to match up with Steven Spielberg movie searches. Of course, it's also possible that both blog entries are still on the page, in which case you may dramatically move up the rankings for searches involving Steven Spielberg's first film, Duel, which featured a renegade 18-wheeler powered by a diesel engine.

This example is admittedly simplified, but hopefully you get the idea regarding web crawling and search engine results. The main point to be taken: increased web crawling of your web pages results in more accurate web searches but not necessarily any improvement in search rankings. Incidentally, if I had a magic XML bullet for improving search rankings, you'd be watching me on an infomercial instead of reading this tutorial!