As you have seen in this chapter, writing user agents to automate operations that connect to Web servers can be greatly simplified using the LWP::UserAgent module. It is important to note, however, that the examples you have seen here work only with HTML documents. As Web content grows richer to include other non-text based document formats (such as PDF), it will become more important to be able to add more advanced indexing capability by leveraging work that has already been done using Perl5.
Search Engines
- On-Site Searching with Glimpse
- Search the Web with WWW::Search
There are two basic ways we access information on the Web: browsing and searching. The Web's popularity and power is based on its vast amounts of hyperlinked documents. You can browse from one page to another, clicking on the links which interest you or focus on what you are looking for. Starting from a single home page, or a page such as Yahoo!, you can click to anywhere else on the Web.
However, as more and more information becomes available on the Web, even the best indexes can't provide links to all of the information. With tens of millions of Web pages currently on servers all over the world, it is simply impossible and impractical to "browse" through an index of these documents to find the information you are looking for.
So, as the Web has expanded, we have seen the birth of search engines. At first, these search engines could be found on the more prominent index sites such as Yahoo!. The search engine could locate a list of Web sites that matched a given search criteria. Today Web sites such as Digital Equipment Corporation's AltaVista allow searching of the entire Web with giant supercomputers with gigabytes of memory.
When implementing a search engine on your site, consider how it is implemented from a user's standpoint. Many of the search functions I find on the Web today are totally useless because of the way their interface was designed. The typical user does not want to take the time to learn the syntax of a complicated "valid" search query and is easily annoyed with the "black box" nature of some search mechanisms. This is especially true if the search mechanism fails to return the appropriate (or any) response to the user.
In this chapter, I will introduce you to how Perl5 can be used to access information locally on your site and globally on any site on the Web. If implemented properly, these tools will allow even the most terribly constructed, even misspelled, search query to return appropriate information to those searching your site.