CGI and Perl

Listing 9.6. Converting a relative URL to an absolute URL.

sub getAbsoluteURL {
   my($parent,$current)=@_;
   my($absURL)="";
   $pURL = new URI::URL $parent;
   $cURL = new URI::URL $current;
   if ($cURL->scheme() eq `http') {
      if ($cURL->host() eq "") {
          $absURL=$cURL->abs($pURL);
      } else {
         $absURL=$current;
      }
   }
   return $absURL;
}

The only remaining function besides the main program is writeToLog(). This is a very straightforward function. All you need to do is open the log file and write a line containing the title and URL. For simplicity, write each to separate lines, thus avoiding having to parse anything during lookup. All titles will be on odd-numbered lines and all URLs on the even-numbered lines immediately following the title. If a document has no title, a blank line will appear where the title would have been. Listing 9.7 shows the writeToLog() function.

Listing 9.7. Writing the title and URL to the log file.

sub writeToLog {
    my($logFile,$url,$title)=@_;
    if (open(OUT,">> $logFile")) {
       print OUT "$title\n";
       print OUT "$url\n";
       close(OUT);
    } else {
       warn("Could not open $logFile for append! $!\n");
    }
 }

Now you can put this all together in the main program. The program will accept multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive calls. Listing 9.8 shows the code for specifying these criteria.

Listing 9.8. Specifying the starting points and stopping points.

#!/public/bin/perl5
 use URI::URL;
 use LWP::UserAgent;use HTTP::Request;
 use HTML::Parse;
 use HTML::Element;
 my($ua) = new LWP::UserAgent;
 if (defined($ENV{`HTTP_PROXY'})) {
    $ua->proxy(`http',$ENV{`HTTP_PROXY'});
 }
 $MAX_DEPTH=20;
 $CRLF="\n";
 $URL_LOG="/usr/httpd/index/crawl.index";
 my(@visitedAlready)=();
 foreach $url (@ARGV) {
    &crawlIt($ua,$url,$URL_LOG,\@visitedAlready,0);
 }

There is another module available called RobotRules that will make it easier for you to abide by the Standard for Robot Exclusion. This module parses a file called robots.txt in the remote directory to see find out if robots are allowed at the site. For more information on the Standard for Robot Exclusion refer to
http://www.webcrawler.com/mak/projects/robots/norobots.html