CGI and Perl

On-Site Searching with Glimpse

Glimpse is a powerful set of UNIX tools that provide an excellent foundation for a search engine on any UNIX based Web server. Glimpse (GLobal IMPlicit SEarch) is a powerful "indexing and query system" that allows you to search through large numbers of files on your server very quickly. Glimpse is used in the same way as the popular UNIX command grep, except that it can search entire filesystems. For example, if you are looking for the word "help" in some file located anywhere on your server, all you have to do is type "glimpse help," and all lines containing "help" will appear preceded by the file name.

Glimpse was developed by Udi Manber and Burra Gopal, at the University of Arizona, and Sun Wu, at the National Chung-Cheng University in Taiwan. At the time of this writing, Glimpse is at version 4.0. Source and precompiled binaries of Glimpse can be found at: http://glimpse.cs.arizona.edu/

The Glimpse package contains the programs agrep, glimpse, glimpseindex, and glimpseserver. To use Glimpse from the command line you must first "index" your files with glimpseindex. The glimpseindex program creates an optimized index file which contains a "hash" of all the data in your files. Glimpse will search through the index file instead of the actual data. Since Glimpse searches the index file, and not the actual data, it is important that the index file be kept up to date. Running glimpseindex on a nightly basis from cron, a utility which executes tasks on a regular basis, is typically a good idea. Using glimpseindex to create an index is very simple. To use glimpseindex to index all files in the a directory tree rooted at /public_html type (or place in your crontab) the following:

glimpseindex /public_html

Afterwards, Glimpse can quickly and efficiently search through all of the documents indexed in the /public_html directory.

TIP:

Pay close attention to what you are indexing. If you want to index all of the Web pages on your server, your glimpseindex need only contain the files under the public HTML directory. Images are located in the public HTML document area and need not be indexed, so they should be placed in a directory not indexed by Glimpse.

Glimpse Indexes

Glimpse indexes are highly optimized files containing representations of the actual data on your system. By searching for patterns in these index files, Glimpse can quickly query large amounts of data. Glimpse supports three types of indexes: a tiny one (2 to 3 percent of the size of all files indexed), a small one (7 to 9 percent), and a medium one (20 to 30 percent). The relative size of the index file generated can be specified when you build the index file with glimpseindex. The larger the index the faster the search. The size of the index you plan to use should be based on the resources you have on your server. If you had a fast server (say a Silicon Graphics WebForce server) with limited disk resources you would probably want to use a smaller index file. If you have lots of disk space, and a slower Intel-based server, you might consider using a bigger index. Glimpse supports "approximate matching" (finding misspelled words), Boolean queries, and limited forms of regular expressions. Details can be found in the Glimpse man pages or on its Web site.