Perl 2003 Advent Calendar: LWP::Simple

Simple modules, are well, simple. Let's download a web page using the get method:

  #!/usr/bin/perl

  # turn on perl's safety features
  use strict;
  use warnings;

  # define all the handy functions 
  use LWP::Simple;

  my $document = get("http://news.bbc.co.uk/")
    or die "Couldn't download the BBC news";

  if ($document =~ /terror/i)
   { print "Don't panic Mr Mannering! Don't Panic!" }

The get method returns the content on success, and undef on failure. It's that simple. It doesn't do multiplexing. It doesn't do setting up of complex user agents. It doesn't even do proper error handling (it either got the document or it didn't) but it just works. It's great.

One of the things I do when I write my code is to use LWP::Simple at first, importing the get method into my code, and then at a later date I can replace it with something that does more or less exactly what I want with custom code. For example, I start off with something like this:

  use LWP::Simple qw(get);

And then later I'll remove that line completely and go to the lengths of setting up my own user agent, writing better error checking, etc, etc.

Even if you don't want to go the the lengths described here it's often considered good manners to modify your agent to properly identify itself. This is useful in case you make a mistake in your program and accidentally leave your program, unawares to you, to go wreak havoc on some unsuspecting site.

LWP lets you access the UserAgent, the module that actually represents the abstract web browser, by asking $ua to be exported into your namespace:

  use LWP::Simple qw(get $ua);

You can then call methods directly on the object that set useful identification settings.

  # set a useful id string
  $ua->agent(q{Mark's Agent (mark@twoshortplanks.com)}),

  # explicitly use the defined header where people can
  # complain if they're getting hit by the agent too often
  $ua->from('mark@twoshortplanks.com');

You can also set other options, like the time a request should run for before timing out:

  # time out after a minute
  $ua->timeout(60);

Keeping Local Files Up To Date

These days, it's fairly common for frequently updated websites to have their content syndicated by RSS feeds, a simple XML RDF file format that contains some of the content of the site that people could download every so often (say every hour or so.)

What we need to do is write a script that downloads that file and stores a local copy on the disk on my laptop. My first attempt makes use of LWP::Simple's get method

  #!/usr/bin/perl

  # turn on perl's safety features
  use strict;
  use warnings;

  # get the data
  use LWP::Simple;
  my $rss = get("http://perladvent.org/perladvent.rdf")
    or die "Mark probably broke the server again!\n";

  # write the data out to file
  use IO::File;
  my $fh = IO::File->new(">perladvent.rdf")
    or die "Can't open perladvent.rdf for writing: $!";
  print $fh $rss;

The action of getting a file and saving it to file is actually so common that there's a shortcut here, the getstore routine.

  #!/usr/bin/perl
                                                        
  # turn on perl's safety features
  use strict;
  use warnings;

  my $url  = "http://perladvent.org/perladvent.rdf";
  my $file = "perladvent.rdf";
 
  # get the data
  use LWP::Simple;
  is_success(getstore($url, $file))
    or die "Mark probably broke the server again!\n";

The getstore method writes it's output directly to the file. It returns a HTTP response code. We're all familiar with the code 404 which means a page isn't found, or 500 which means the CGI didn't work but there's a whole other range of these things indicating various things. The easiest way to check if the code represents something good or bad is to use the is_success or is_error functions which return true if what they're testing is correct.

RSS Feeds and Bandwidth

When RSS started out the content was very minimal, only containing the title of pages, the page's URLs, and if you were very luckily a small paragraph summarising the page. However, more and more these days people are putting the majority - if not all - of their content into these RSS feeds as an alternative delivery system to using HTML.

While this is a very nice thing to do, as it allows people with RSS aggregators like NetNewsWire or Timesink to read the meaningful content from the site without having to ever visit a real webpage, it does incur huge bandwidth costs for popular sites. Especially as clients tend to essentially suck down the entire contents of the site every hour (or more often if they're not paying attention to the standards) even if no new content has been posted. This is very bad for all concerned, but most of all if the people who are running the site are paying by the kilobyte (for the record, my bandwidth is unmetered, so I don't care about people doing it to me, just accidentally doing it to other people.)2A

What we really need to do is only get the content if the page has been updated.

For example The Perl Advent Calendar's main RSS is actually a static file that's created by a Perl script at midnight or whenever I update the script in between. As I write this it's timestamp reflects that I last made a change to the order of the modules to be published about quarter past eleven last night:

  mark@gan:/virtual/perladvent.org/www/html$ ls -l perladvent.rdf 
  -rw-r--r--    1 mark     mark         1062 Dec 20 23:14 perladvent.rdf

The problem is that the code I wrote above above is downloading the entire site every time it's executed. If it's run every hour it downloads the file each hour, irrespective of how often it changes. What we want to do is compare the filetime of the local file we have on the hard drive with that of the file on the server before we get the whole content delivered to us.

head

The normal way to see if a file has been updated or not is to use a HEAD request. This request is essentially similar to the normal GET request we make with get, but we ask only header information to be sent without the content. From the command line:

  servalan:~ mark$ lwp-request -m head perladvent.org/perladvent.rdf
  200 OK
  Connection: close
  Date: Sun, 21 Dec 2003 23:03:59 GMT
  Accept-Ranges: bytes
  Server: thttpd/2.23beta1 26may2002
  Content-Length: 1062
  Content-Type: text/plain; charset=iso-8859-1
  Last-Modified: Sat, 20 Dec 2003 23:14:16 GMT
  Client-Date: Sun, 21 Dec 2003 23:04:00 GMT
  Client-Peer: 195.82.114.51:80
  Client-Response-Num: 1

From this we can see that the feed was last updated on Saturday just before midnight - it's just the time from the file listing above. We can also see that thttpd's sending the wrong content-type for an XML file, but that's a whole other issue ;-).

So one thing we can do is make a HEAD request inside our script before we make a GET request, thus meaning we don't have to download the whole RSS feed if it hasn't changed.

  #!/usr/bin/perl
                                                        
  # turn on perl's safety features
  use strict;
  use warnings;
 
  my $url  = "http://perladvent.org/perladvent.rdf";
  my $file = "perladvent.rdf";

  # get the remote modification time
  use LWP::Simple;
  my ($content_type, $document_length,
       $modified_time, $expires, $server) = head($url);
  
  # get the local modification time
  my $local_modified_time = (stat $file)[9];

  # check the modification times
  if ($modified_time > $local_modified_time)
  { 
    # and get the content if we need to
    is_success(getstore($url, $file))
      or die "Mark probably broke the server again!\n";
  }

The Bigger Solution

While this is a good solution, and works quite well if we were mainly concerned in just seeing if the content had been updated or not, what we really need to do is actually make a proper GET request and send along the modification time of our local copy of the file is, and have the webserver decide if it needs to send us full data or not.

The normal way to do this with LWP is crafting a HTTP::Request object which you then hand to your user agent that contains all the information about your request (what URL you want to get, your modification time, what content types you can accept, etc, etc.) LWP::Simple can simplify this process by using the mirror function. This method takes the same arguments as get_store, however it examines the existing file before it sends it and sends extra information along with it, including the If-Modified-Since header, to allow the server to simply reply, via the status code, that the local copy doesn't need updating.

  #!/usr/bin/perl
                                                        
  # turn on perl's safety features
  use strict;
  use warnings;
 
  my $url  = "http://perladvent.org/perladvent.rdf";
  my $file = "perladvent.rdf";

  # update the content
  use LWP::Simple;
  my $status = mirror($url, $file);

  # check the response
  if (is_error($status))
    { die "Mark probably broke the server again!\n" }
 
  # tell us if the content hasn't changed
  print "Content not changed\n"
   if $status == RC_NOT_MODIFIED;

Conclusion

LWP::Simple gives you a lot of bang for your buck, and each of the four functions I've discussed here can download and do something useful with a page (put it in a scalar, put it on disk, just get the header information, get content if it's updated). It can easily let you write code to simply do what is a complicated task, and importantly it allows you to easily make sure your code does the right thing and respects other people's code.