CGI and Perl

Listing 14.3. relativize.

#!/usr/bin/perl
 # relativize - parse html documents and change complete urls
 # to relative or partial urls, for the sake of brevity, and to assure
 # that connections to mirror sites grab the mirror's copy of the files
 #
 # Usage: relativize hostname file newfile basepath
 # hostname is the local host
 # file is the html file you wish to parse
 # newfile is the new file to create from the original
 # basepath is the full path to the file, from the http root
 #
 # Example:
 # relativize www.metronet.com perl5.html newperl5.html /perlinfo/perl5
 #
 # Note: does not attempt to do parent-relative directory substitutions
 use HTML::TreeBuilder;
 use URI::URL;
 require 5.002;
 use strict;
 my($h,$filename,$link,$base,$url);
 my($usage,$localhost,$filename,$newfile,$base_path);
 $usage ="usage: $0 hostname htmlfile newhtmlfile BasePath\n";
 $localhost= shift;
 $filename = shift;
 $newfile= shift;
 $base_path = shift;
 die $usage unless defined($localhost) and defined($filename)
      and defined($base_path) and defined($newfile);
 $h = HTML::TreeBuilder->new;
 $h->parse_file($filename);
 (open(NEW,">$newfile")) or die($usage);
 $h->traverse(\&relativize_urls);
 sub relativize_urls {
     my($e, $start,$depth) = @_;
     # if we've got an element
     if(ref $e){
         my $tag = $e->tag;
         if($start){
             # if the tag is an "A" tag
             if($tag eq  "a"){
                 my $url = URI::URL->new( $e->{href} );
                 # if the scheme of the url is http
                 if($url->scheme eq "http"){
                     # if the host is the local host, modify the
                     # href attribute to have the relative url only.
                     if($url->host eq $localhost){
                         # if the path is relative to the base path
                         # of this file (specified on command line)
                         my $path = $url->path;
                         if($path =~ s/^$base_path\/?//){
                             # a filetest could be added here for assurance
                             $e->attr("href",$path);
                         }
                     }
                 }
             }
             print NEW  $e->starttag;
         }
         elsif((not ($HTML::Element::emptyElement{$tag} or
                 $HTML::Element::optionalEndTag{$tag}))){
             print NEW $e->endtag,"\n";
         }
     # else text stuff, just print it out
     } else {
         print NEW $e;
     }
 }

In the subroutine relativize_urls(), I've borrowed the algorithm from the HTML::Element module's method, called as_HTML(), to print everything from within the HTML file by default. A reference to the relativize_urls() subroutine is passed into the traverse() method, inherited in the HTML::TreeBuilder class, from the HTML::Element class. The desired changes to the complete URLs that refer to local files are made after verification that the path component has the specified base path, and the host is the localhost. The output goes to the new HTML file, specified on the command line.

There are plenty of other uses for the HTML::TreeBuilder class and its parent classes, HTML::Element and HTML::Parser. See the POD documentation for the libwww modules for more details. Moving an Entire Archive Copying an external archive may give rise to the need to change the file or directory names associated with the external site, and then to correct the URLs in the HTML files. There may be several reasons for this: The copying site may wish to use a different layout of the archive; or, as mentioned previously, it may be using a DOS file system or follow an ISO9660 naming policy, which requires a change of file or directory names if they're not ISO9660-compliant. Placing an archive's contents on a CD-ROM may also require renaming or re-organizing the original structure. Whatever the reason, this task can be quite intimidating to perform.

The algorithm itself implies six steps and three complete passes over the archive, using File::Find, or something similar, in order to get everything right. Let's consider the case where you need to migrate an archive from a UNIX file system, which allows long filenames, to a DOS file system, which doesn't. I'm not providing any code here; I'm simply outlining the algorithm, based on a former consulting job where the I performed the service for a client.

Step 1: List Directories The first pass over the archive should create a listing file of all the directories, in the full path form, within the archive. Each entry of the list should have three components: the original name; then if the current directory's name is too long, the list entry should have the original name with any parent directories' new names; followed by the new name, which is shortened to eight alpha-numeric characters and doesn't collide with any other names in the current directory, prepended with all of the parent directories' new names.
Step 2: Rename Directories Directories should be renamed during this step, based on the list created during pass one. The list has to be sorted hierarchically--from the top level to the lowest level--for the renaming operations to work. The original name of the current directory, with its parent directories' new names as a full path, should be the first argument to rename(), followed by the new short name, with any new parents in the path. These should be the second and third elements of the list created during pass one.
Step 3: List Files The third step makes another pass over the archive, creating another list. This list will have the current (possibly renamed) directory and original filename of each file, as a full path, followed by the current directory and the new filename. The new filename will be shortened to the 8.3 format and with verification, again, that there are no namespace collisions in the current directory.
Step 4: Rename Files The fourth step should rename files, based on the list created in pass three.
Step 5: Create HTML Fixup List The fifth step in the algorithm takes both lists created previously and creates one final list, with the original filename or directory name for each file or directory, followed by the current name. Again, both of these should be specified as a full path. This list will then be used to correct any anchors or links that have been affected by this massive change and that live in your HTML files.
Step 6: Fix HTML Files The final step in the algorithm reads in the list created in Step 5, and opens each file for fixing the internal links that still have the original names and paths of the files. It should refer to the list created in Step 5 to decide whether to change a given URL during the parsing process and overwrite the current HTML file. Line termination characters should be changed to the appropriate one for the new architecture at this time, too.
It's a rather complicated process, to be sure. Of course, if you design your archive from the original planning stages to account for the possibility of this sort of task (by using ISO9660 names), then you'll never have to suffer the pain and time consumption of this process. Verification of HTML Elements The process of verifying the links that point to local documents within your HTML should be performed on a regular basis. Occasionally, and especially if you're not using a form of revision control as discussed previously, you may make a change to the structure of your archive that will render a link useless until it is changed to reflect the new name or location of the resource to which it points.

Broken links are also a problem that you will confront when you're using links to external sites' HTML pages or to other remote resources. The external site may change its layout or structure or, more drastically, its hostname, due to a move or other issues. In these cases, you might be notified of the pending change or directly thereafter--if the remote site is "aware" that you're linking to its resources. (This is one reason to notify an external site when you create links to its resources.) Then, at the appropriate time, you'll be able to make the change to your local HTML files that include these links.

Several scripts and tools are available that implement this task for you. Tom Christiansen has written a simple one called churl. This simple script does limited verification of URLs in an HTML file retrieved from a server. It verifies the existence and retrievability of HTTP, ftp, and file URLs. It could be modified to suit your needs, and optionally verify relative (local) URLs or partial URLs. It's available at the CPAN in his authors directory:

~/authors/id/TOMC/scripts.

He has also created a number of other useful scripts and tools for use in Web maintenance and security, which also can be retrieved from his directory at any CPAN site.

The other tool we'll mention here, called weblint, is written by Neil Bowers and is probably the most comprehensive package available for verification of HTML files. In addition to checking for the existence of local anchor targets, it also thoroughly checks the other elements in your HTML file.

The weblint tool is available at any CPAN archive, under Neil Bower's authors directory:

~/authors/id/NEILB/weblint-*.tar.gz.

It's widely used and highly recommended. Combining this tool with something such as Tom Christiansen's churl script will give you a complete verification package for your HTML files. See the README file with weblint for a complete description of all the features.