CGI and Perl

The Power of Perl in Text File Processing

Now that you have the front end, it's time to write the search engine itself. Use the File::Find library, available in the Perl distribution. This library does all of the directory scanning for you, leaving you to simply implement the scanning algorithm. This scanning algorithm searches for each word, keeping a count of occurrences of each word. When it comes time to display the search results, you can display them in the order of occurrences, which will give the user the most likely page they are looking for right at the top. This concept should not be entirely new to you if you have visited one of the popular search sites on the Web.

Assuming you have extracted the list of words to search for, you'll simply write a function that accepts a word list as an argument, along with the file to scan. Let's leave it up to the File::Find module to pass you the files, as shown in Listing 7.9.

Listing 7.9. Subroutine to search for a list of words.

sub wanted {
    # This line gets rid of all Unix-type hidden files/directories.
    return if $File::Find::name=~/\/\./;
    # Only look at HTML files.
    if ($File::Find::name=~/^.*\.html$/) {
       if (!open(IN, "< $File::Find::name")) {
          # This error message will appear in your error_log file.
          warn "Cannot open file: $File::Find::name...$!\n";
          return;
       }
       my(@lines)=<IN>;
       close(IN);
       my($count)=0;
       foreach (@words) {
          # Make the search case-insensitive.
          $word="(?i)$_";
          $count+=grep(/$word/,@lines);
       }
       if ($count>0) {
          # Add this page to the list of found items.
          push(@foundList,"$File::Find::name");
          # Store the hit count in an associate array
          # with the page as the key.
          $hitCounts{"$File::Find::name"}=$count;
       }
    }
 }

Note:

If you are running on a UNIX system where the egrep command is available, you should consider replacing the majority of this Perl code with a call to egrep, as follows:

@hitList=`egrep -ci `(word1|word2|word3)' $File::Find::name`;

This would be more efficient in terms of memory requirements and processor use.


File::Find contains a function called finddepth(), which takes at least two arguments: a filter function and one or more directory names to recurse. The filter function you are using is the one above called wanted(). finddepth()calls wanted() for each file that it comes across. The filename is contained in the variable $_. The file path is contained in the variable $File::Find::dir. You have used the variable $File::Find::name, which is the combination of the other two variables, with a path separator stuck in between. By using the functionality provided by File::Find, all you need to do is add in your search filter and not worry about recursion and figuring out what's a file and what's a directory.

The code used to initiate the search looks like this:

@words=split(/ /,$q->param(`SearchString'));
 if (@words>0) {
    finddepth(\&wanted,"/user/bdeng/Web/docs");
 }

It's probably a good idea to check the @words array so that it contains at least one value. No need to make finddepth() do all that work if you have nothing to search for. In this particular case, you might emit some HTML that politely reminds the user to specify something to search for.