use URI::URL; use HTML::TreeBuilder; use LWP::Simple qw(get); my($h,$link,$base,$url); $base = "http://www.best.com/"; $h = HTML::TreeBuilder->new; $h->parse(get($base)); foreach $pair (@{$h->extract_links(qw<a img>)}) { my($link,$elem) = @$pair; $url = url($link,$base); print $url->abs,"\n"; }
Running Listing 14.2, with the libwww module properly installed creates the following output, based on the URL
http://www.best.com/index.html
my current ISP, and all of the A
and IMG
links within that page:
http://webx.best.com/cgi-bin/imagemap/mainpl.map http://www.best.com/images/mainpnl3.gif http://www.best.com/about.html http://www.best.com/images/persoff.gif http://www.best.com/corp.html http://www.best.com/images/corpserv.gif http://www.best.com/policy.html http://www.best.com/images/ourpol.gif http://www.best.com/support.html http://www.best.com/images/faq.gif http://www.best.com/prices.html http://www.best.com/images/pricepol.gif http://www.best.com/pop.html http://www.best.com/images/lan.gif http://www.best.com/corpppp.html http://www.best.com/images/webpd.gif http://www.best.com/client.html http://www.best.com/images/hosted.gif http://www.best.com/images/announce.gif http://www.onlive.com/ http://crystal.onlive.com/beta/index.htm http://www.best.com/best_resort/entrance.sds mailto:info@best.com http://www.best.com/images/best4.gif mailto:www@best.com
Note that this listing may vary, depending on your location and whether there have been changes to index.html
since this chapter was written. Now that you've seen how to use the LWP
modules to do some very simple parsing, let's take a look at how to use them for some useful tasks.
Editing and Verifying HTML
You can use Perl in a number of ways to make changes in and perform verification and validation on HTML. There are modules that handle the parsing and substitutions, as well as several complete tools to check the syntax of the HTML and the validity of the internal anchors to other locations and documents. The following examples demonstrate how to use these tools to perform tasks that may confront you as a Webmaster from time to time. Converting from Absolute to Relative URLs
Suppose that at some point, when the Webmaster is coming up to speed on the HTML specifications, he or she creates a document that uses the complete form of the URL in all links, giving the scheme, host, and path. Later, as understanding grows, the Webmaster wishes to go back and change all the links in the HTML documents that correspond to local resources to have the relative form. This way, if any site is mirroring his/her site, requests for local documents from the mirror copy will be served from the mirror site instead of the master site.
In order to accomplish this task, you'll need to start with the script that parses URLs generally, shown in Listing 14.2. Then you'll add the capability (see Listing 14.3) to print out the new HTML file with the links changed to relative form when they refer to local resources.