sub getAbsoluteURL { my($parent,$current)=@_; my($absURL)=""; $pURL = new URI::URL $parent; $cURL = new URI::URL $current; if ($cURL->scheme() eq `http') { if ($cURL->host() eq "") { $absURL=$cURL->abs($pURL); } else { $absURL=$current; } } return $absURL; }
The only remaining function besides the main program is writeToLog(). This is a very straightforward function. All you need to do is open the log file and write a line containing the title and URL. For simplicity, write each to separate lines, thus avoiding having to parse anything during lookup. All titles will be on odd-numbered lines and all URLs on the even-numbered lines immediately following the title. If a document has no title, a blank line will appear where the title would have been. Listing 9.7 shows the writeToLog() function.
Listing 9.7. Writing the title and URL to the log file.
sub writeToLog { my($logFile,$url,$title)=@_; if (open(OUT,">> $logFile")) { print OUT "$title\n"; print OUT "$url\n"; close(OUT); } else { warn("Could not open $logFile for append! $!\n"); } }
Now you can put this all together in the main program. The program will accept multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive calls. Listing 9.8 shows the code for specifying these criteria.
Listing 9.8. Specifying the starting points and stopping points.
#!/public/bin/perl5 use URI::URL; use LWP::UserAgent;use HTTP::Request; use HTML::Parse; use HTML::Element; my($ua) = new LWP::UserAgent; if (defined($ENV{`HTTP_PROXY'})) { $ua->proxy(`http',$ENV{`HTTP_PROXY'}); } $MAX_DEPTH=20; $CRLF="\n"; $URL_LOG="/usr/httpd/index/crawl.index"; my(@visitedAlready)=(); foreach $url (@ARGV) { &crawlIt($ua,$url,$URL_LOG,\@visitedAlready,0); }
There is another module available called RobotRules that will make it easier for you to abide by the Standard for Robot Exclusion. This module parses a file called robots.txt in the remote directory to see find out if robots are allowed at the site. For more information on the Standard for Robot Exclusion refer to
http://www.webcrawler.com/mak/projects/robots/norobots.html