The 2003 Perl Advent Calendar
[about] | [archives] | [contact] | [home]

On the 5th day of Advent my True Language brought to me..
Template::Extract

Template::Extract is a way of extracted data from an unstructured data source without having to write any regular expressions.

As a perversion of the Template Toolkit grammar, it allows you to simply copy and paste chunks of webpages that you want to extract data from into a template file and name the bits you want to keep and throw away the bits that change between pages that you want to ignore.

While there are better parsing tools for extracting things from HTML there are very few that are so quick and easy to use, and I've got to admit that this module has become one of my regular tools for simply quickly getting the job.

The

  • time and date.com Worldclock
  • is a website that shows the times for cities around the world. This information is actually quite hard to work out on your local computer - especially when things like Sydney changing timezones because they're hosting the Olympics happen. What we'd like to do is extract this programmatically so that anytime we want to look up the time in a city from the command line we can.

    When we look at the source of the page, we see that each of the cities look like this:

     <a href="city.html?n=51">Buenos Aires</a>
     </td><td class=r>Wed 8:57 PM</td>

    This will form the basis of our extraction template. What we need to do first is replace all the sections in the code that we want to capture by a TT variable:

     <a href="city.html?n=51">[% city %]</a>
     </td><td class=r>[% time %]</td>

    And we need to replace all the bits that could change with the special 'it doesn't matter' tag [% ... %].

     <a href="city.html?n=[% ... %]">[% city %]</a>
     </td><td class=r>[% time %]</td>

    We then build this into a simple program that goes and gets the data:

      #!/usr/bin/perl
      
      # turn on perl's safety features
      use strict;
      use warnings;
      
      # create a new template extractor
      use Template::Extract;
      my $extract = Template::Extract->new();
      # get the source of the page
      use LWP::Simple qw(get);
      my $document = get "http://www.timeanddate.com/worldclock/"
        or die "Can't get page";
      # define the template
      my $template = << '.';
      <a href="city.html?n=[% ... %]">[% city %]
      </a></td><td class=r>[% time %]</td>
      .
      # extract the data
      my $data = $extract->extract($template, $document);
      # print it out so we can see we've got what we want to
      use Data::Dumper;
      print Dumper $data;

    And when we run it it prints out the first city on the list:

      $VAR1 = {
                'city' => 'Addis Ababa',
                'time' => 'Thu 3:18 AM'
              };

    We now need to modify the code so that it extracts all the values for all days. We modify it to use the FOREACH directive.

      my $template = << '.';
      [% FOREACH place %]
      [% ... %]
      <a href="city.html?n=[% ... %]">[% city %]</a>
      </td><td class=r>[% time %]</td>
      [% END %]
      .

    Now when we run this it this prints out a deeper data structure:

      $VAR1 = {
                'place' => [
                             {
                               'city' => 'Addis Ababa',
                               'time' => 'Thu 3:23 AM'
                             },
                             {
                               'city' => 'Hanoi',
                               'time' => 'Thu 7:23 AM'
                             },
                             {
                               'city' => 'New York',
                               'time' => 'Wed 7:23 PM'
                             },

    And so on. So we've managed to extract the data we needed without having to think about using a single regular expression. But hang on, what's this? One of our entries doesn't look right:

      {
         'city' => 'Adelaide</a> *</td><td class=r>Thu 10:53 AM<
      /td><td><a href="city.html?n=96">Harare',
         'time' => 'Thu 2:23 AM'
      },

    Some of the entries on that page have an asterix next to them to indicate that they're using daylight saving time. This is an inherent problem with using an unstructured parsing system - it's just not that flexible and doesn't understand that we wanted whatever was inside the A tags, unlike, say, if we'd tried breaking it up with a proper HTML/XML parser. This said, it doesn't take that long to fix:

      my $template = << '.';
      [% FOREACH place %]
      [% ... %]
      <a href="city.html?n=[% ... %]">[% city %]</a>
      [% ... %]</td><td class=r>[% time %]</td>
      [% END %]
      .

    And now all that's left to do now is simply print out whichever city matched

      # work out what city people were looking for
      my $city = join ' ', map { lc($_) } @ARGV;
      # check each of the cities in turn
      foreach my $entry (@{ $data->{place} })
      {
        # is this the city we were looking for?
        if (lc($entry->{city}) eq $city)
        {
           print "$entry->{time}\n";
           exit;
        }
      }
     
      # if we got this far then we didn't find the city
      print "No city called '$city' found\n";
      exit 1;

    Copyright

    I draw your attention to the

  • copyright notice
  • for Time and Date.com

  • Painless RSS with Template::Extract, excert from Spidering Hacks published by O'Reilly
  • Time and Date.com
  • Using XML::LibXML to extract data
  • Perl & LWP, published by O'Reilly
  • Data Munging With Perl, published by Manning