Mirroring can be accomplished by starting at the home page of a server and recursively traversing through all of its local links to determine the files that need to be copied. Using this approach and much of the code in the previous examples, you can fairly easily automate the process of mirroring a Web site.
We will make the assumption that any link reference that is a relative URL rather than an absolute one should be considered local and thus needs to be mirrored. All absolute URLs will be considered documents owned by other servers, which we can ignore. This means that the following types of links will be ignored:
<A HREF=http://www.netscape.com> <A HREF=ftp://ftp.netscape.com/Software/ns201b2.exe> <A HREF=http://www.apple.com/cgi-bin/doit.pl>
However, these links will be considered local and will be mirrored:
<A HREF=images/home.gif> <A HREF=pdfs/layout.pdf> <A HREF=information.html> <IMG SRC=images/animage.gif>
The LWP::UserAgent module contains a method called mirror(), which gets and stores a Web document from a server using the modification date and content length to determine whether or not it needs mirroring.
The changes you would need to make to the sample above are fairly minimal. For example, getAbsoluteURL() would be changed to return an absolute URL only for URLs local to the server you are mirroring, as shown in Listing 9.9.
Listing 9.9. Modified function to convert relative URLs to absolute URLs.
sub getAbsoluteURL { my($parent,$current)=@_; my($absURL)=""; $pURL = new URI::URL $parent; $cURL = new URI::URL $current; if ($cURL->scheme() eq `http') { if ($cURL->host() eq "") { $absURL=$cURL->abs($pURL); } } return $absURL; }
The other change would be in crawlIt(), shown earlier in Listing 9.5. Instead of writing the URL and title to the log, follow Listing 9.10 to call a subroutine called mirrorFile(), which utilizes the LWP::UserAgent mirror() method. You should also search for other file references such as the image element or <IMG> tag.