Template::Extract is a way of extracted data from an unstructured data source without having to write any regular expressions.
As a perversion of the Template Toolkit grammar, it allows you to simply copy and paste chunks of webpages that you want to extract data from into a template file and name the bits you want to keep and throw away the bits that change between pages that you want to ignore.
While there are better parsing tools for extracting things from HTML there are very few that are so quick and easy to use, and I've got to admit that this module has become one of my regular tools for simply quickly getting the job.
The
When we look at the source of the page, we see that each of the cities look like this:
<a href="city.html?n=51">Buenos Aires</a> </td><td class=r>Wed 8:57 PM</td>
This will form the basis of our extraction template. What we need to do first is replace all the sections in the code that we want to capture by a TT variable:
<a href="city.html?n=51">[% city %]</a> </td><td class=r>[% time %]</td>
And we need to replace all the bits that could change with the special
'it doesn't matter' tag [% ... %]
.
<a href="city.html?n=[% ... %]">[% city %]</a> </td><td class=r>[% time %]</td>
We then build this into a simple program that goes and gets the data:
#!/usr/bin/perl # turn on perl's safety features use strict; use warnings; # create a new template extractor use Template::Extract; my $extract = Template::Extract->new();
# get the source of the page use LWP::Simple qw(get); my $document = get "http://www.timeanddate.com/worldclock/" or die "Can't get page";
# define the template my $template = << '.'; <a href="city.html?n=[% ... %]">[% city %] </a></td><td class=r>[% time %]</td> .
# extract the data my $data = $extract->extract($template, $document);
# print it out so we can see we've got what we want to use Data::Dumper; print Dumper $data;
And when we run it it prints out the first city on the list:
$VAR1 = { 'city' => 'Addis Ababa', 'time' => 'Thu 3:18 AM' };
We now need to modify the code so that it extracts all the values for
all days. We modify it to use the FOREACH
directive.
my $template = << '.'; [% FOREACH place %] [% ... %] <a href="city.html?n=[% ... %]">[% city %]</a> </td><td class=r>[% time %]</td> [% END %] .
Now when we run this it this prints out a deeper data structure:
$VAR1 = { 'place' => [ { 'city' => 'Addis Ababa', 'time' => 'Thu 3:23 AM' }, { 'city' => 'Hanoi', 'time' => 'Thu 7:23 AM' }, { 'city' => 'New York', 'time' => 'Wed 7:23 PM' },
And so on. So we've managed to extract the data we needed without having to think about using a single regular expression. But hang on, what's this? One of our entries doesn't look right:
{ 'city' => 'Adelaide</a> *</td><td class=r>Thu 10:53 AM< /td><td><a href="city.html?n=96">Harare', 'time' => 'Thu 2:23 AM' },
Some of the entries on that page have an asterix next to them to indicate that they're using daylight saving time. This is an inherent problem with using an unstructured parsing system - it's just not that flexible and doesn't understand that we wanted whatever was inside the A tags, unlike, say, if we'd tried breaking it up with a proper HTML/XML parser. This said, it doesn't take that long to fix:
my $template = << '.'; [% FOREACH place %] [% ... %] <a href="city.html?n=[% ... %]">[% city %]</a> [% ... %]</td><td class=r>[% time %]</td> [% END %] .
And now all that's left to do now is simply print out whichever city matched
# work out what city people were looking for my $city = join ' ', map { lc($_) } @ARGV;
# check each of the cities in turn foreach my $entry (@{ $data->{place} }) { # is this the city we were looking for? if (lc($entry->{city}) eq $city) { print "$entry->{time}\n"; exit; } } # if we got this far then we didn't find the city print "No city called '$city' found\n"; exit 1;
I draw your attention to the