CGI and Perl

Listing 9.5. The crawlIt() main function of the Web spider.

sub crawlIt {
    my($ua,$urlStr,$urlLog,$visitedAlready,$depth)=@_;
    if ($depth++>$MAX_DEPTH) {
       return;
    }
    $request = new HTTP::Request `GET', $urlStr;
    $response = $ua->request($request);
    if ($response->is_success) {
       my($urlData)=$response->content();
       my($html) = parse_html($urlData);
       $title="";
       $html->traverse(\&searchForTitle,1);
       &writeToLog($urlLog,$urlStr,$title);
       foreach (@{$html->extract_links(qw(a))}) {
          ($link,$linkelement)=@$_;
          my($url)=&getAbsoluteURL($link,$urlStr);
          if ($url ne "") {
             $escapedURL=$url;
             $escapedURL=~s/\//\\\//g;
             $escapedURL=~s/\?/\\\?/g;
             $escapedURL=~s/\+/\\\+/g;
             if (eval "grep(/$escapedURL/,\@\$visitedAlready)" == 0) {
                push(@$visitedAlready,$url);
                &crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);
             }
          }
       }
    }
 }
 sub searchForTitle {
    my($node,$startflag,$depth)=@_;
    $lwr_tag=$node->tag;
    $lwr_tag=~tr/A-Z/a-z/;
    if ($lwr_tag eq `title') {
       foreach (@{$node->content()}) {
          $title .= $_;
       }
       return 0;
    }
    return 1;
 }

Note:

In this function, all of the my() qualifiers are very meaningful. Because this function is called recursively, make sure that you don't accidentally reuse any variables that existed in the previous call to the function. Another thing to note about this function is that errors are silently ignored. You could easily add an error handler that notifies the user of any stale links found.


The other important function you need to write is getAbsoluteURL(). This function takes the parent URL string and the current URL string as arguments. It makes use of the URI::URL module to determine whether or not the current URL is already an absolute URL. If so, it returns the current URL as is; otherwise, it constructs a new URL based on the parent URL. You also need to check that the protocol of the URL is HTTP. Listing 9.6 shows how to convert a relative URL to an absolute URL.