CGI and Perl

Mirroring Remote Sites

One task a Webmaster might want to automate is the mirroring of a site across multiple servers. Mirroring is essentially copying all of the files associated with a Web site and making them available at another Web site. This is done to prevent any major downtime from happening due to a hardware or software failure with the primary server. This is also done to provide identical sites across different locations in the world, so that a person in Beijing doesn't need to access a physical machine in New York but rather can access a physical machine in Hong Kong, which happens to be a mirror of the New York site.

Mirroring can be accomplished by starting at the home page of a server and recursively traversing through all of its local links to determine the files that need to be copied. Using this approach and much of the code in the previous examples, you can fairly easily automate the process of mirroring a Web site.

We will make the assumption that any link reference that is a relative URL rather than an absolute one should be considered local and thus needs to be mirrored. All absolute URLs will be considered documents owned by other servers, which we can ignore. This means that the following types of links will be ignored:

<A HREF=http://www.netscape.com>
 <A HREF=ftp://ftp.netscape.com/Software/ns201b2.exe>
 <A HREF=http://www.apple.com/cgi-bin/doit.pl>

However, these links will be considered local and will be mirrored:

<A HREF=images/home.gif>
 <A HREF=pdfs/layout.pdf>
 <A HREF=information.html>
 <IMG SRC=images/animage.gif>

The LWP::UserAgent module contains a method called mirror(), which gets and stores a Web document from a server using the modification date and content length to determine whether or not it needs mirroring.

The changes you would need to make to the sample above are fairly minimal. For example, getAbsoluteURL() would be changed to return an absolute URL only for URLs local to the server you are mirroring, as shown in Listing 9.9.

Listing 9.9. Modified function to convert relative URLs to absolute URLs.

sub getAbsoluteURL {
    my($parent,$current)=@_;
    my($absURL)="";
    $pURL = new URI::URL $parent;
    $cURL = new URI::URL $current;
    if ($cURL->scheme() eq `http') {
       if ($cURL->host() eq "") {
          $absURL=$cURL->abs($pURL);
       }
    }
    return $absURL;
 }

The other change would be in crawlIt(), shown earlier in Listing 9.5. Instead of writing the URL and title to the log, follow Listing 9.10 to call a subroutine called mirrorFile(), which utilizes the LWP::UserAgent mirror() method. You should also search for other file references such as the image element or <IMG> tag.