Perl Advent Calendar 2006-12-19

Feast on this

by Shlomi Fish

XML::Feed provides a unified, object oriented API to manipulate feeds from both the newer Atom format and RSS variants. Besides the ability to convert between Atom and the various RSS versions, this module can also: fetch feeds directly, merge multiple feeds into a single document, and add or remove individual entries. We will take advantage of many of these features inn this article, to build a simple command line feed aggregator that will combine several remote feeds into one with some rudimentary filtering capabilities.

One can instantiate an XML::Feed object in one of two ways. Either by creating an empty document of the specified format:

XML::Feed->new("RSS")
or giving the URI of a document to fetch and parse:
XML::Feed->parse(URI->new($uri))
And while XML::Feed also supports combining feeds using its splice() method, be aware that it will not automagically splice Atom feeds with RSS ones. To get around this we use the myconvert() subroutine on line 76 of the example; it converts input feeds to the same format as the output.

A list of a feed's entries can be retrieved using the $feed->entries() method. Each one of these entries is an object with a consistent API, regardless of format, for setting and getting their properties. Among the acccessors that the entry object supports are:

Each of these accessors can be used to retrieve the value of the property, or to set it by passing a value. An entry can be added to a feed using a feed add_entry() method. This can be used to create feeds that cull entries from other documents. The feed object itself supports some accessors for the global properties of any feed format: link(), author(), language(), or copyright().

The Desired Feature Set

The first thing we want is the ability to actually specify several URLs for feeds to retrieve and combine. We'll use the --u[rl] argument for this. Next we want to be able to specify an output file. For this we'll use the -o flag. If no output file is specified, the program will output to STDOUT.

Next we'd like to specify the output format for the file—a choice between Atom and RSS—with the --output-format; the default is RSS. We'll also want to be able to limit the number of entries to a certain limit with a reasonable default (--num-entries). For the sake of demonstration we'll provide some filtering capabilities: --subject-filter specifies an optional positive regex to match the subjects of the entries, and --subject-filter-out specifies an optional negative regex that will match against them.

Finally, there will be a way to specify the link for the entire feed in order to customize it a bit.

mod19.pl

   1 #!/usr/bin/perl
   2 
   3 use strict;
   4 use warnings;
   5 
   6 use Getopt::Long;
   7 use List::Util qw(min);
   8 use XML::Feed;
   9 
  10 my @feed_urls;
  11 my $num_entries = 40;
  12 my($output_format, $output_file) = "RSS";
  13 my($subj_filter, $subj_filter_out, $feed_link);
  14 
  15 GetOptions(
  16            'url|u=s@' => \@feed_urls,                   # Sources
  17            'o=s' => \$output_file,                      # Output file
  18            'output-format=s' => \$output_format,        # Output type
  19            'num-entries=i' => \$num_entries,            # Entry limit
  20            'subject-filter=s' => \$subj_filter,         # Positive filter
  21            'subject-filter-out=s' => \$subj_filter_out, # Negative filter
  22            'feed-link=s' => \$feed_link,                # Link location
  23           );
  24 
  25 
  26 my $feed                   = XML::Feed->new($output_format) or
  27   die XML::Feed->errstr;
  28 my $feed_with_less_entries = XML::Feed->new($output_format) or
  29   die XML::Feed->errstr;
  30 if (!defined($feed_link)) {
  31   die "The feed's link was not specified!";
  32 }
  33 else {
  34   $feed_with_less_entries->link($feed_link);
  35 }
  36 
  37 
  38 # With qr// you can have multiple filters like: foo|bar
  39 foreach my $f ($subj_filter, $subj_filter_out) {
  40   if (defined($f)) {
  41     $f = qr/$f/;
  42   }
  43 }
  44 
  45 foreach my $url (@feed_urls) {
  46   my $url_feed = XML::Feed->parse(URI->new($url))
  47     or die XML::Feed->errstr;
  48   $feed->splice(myconvert($url_feed));
  49 }
  50 
  51 my @entries = grep
  52   {
  53     (defined($subj_filter)     ? ($_->title() =~ /$subj_filter/)     : 1) &&
  54     (defined($subj_filter_out) ? ($_->title() !~ /$subj_filter_out/) : 1)
  55   }
  56   $feed->entries();
  57 @entries = reverse(sort { $a->issued() <=> $b->issued() } @entries);
  58 
  59 foreach my $e (@entries[0 .. min($num_entries-1, $#entries)]) {
  60   $feed_with_less_entries->add_entry($e);
  61 }
  62 
  63 
  64 my $out;
  65 if ($output_file) {
  66   open $out, ">", $output_file;
  67 }
  68 else {
  69   open $out, ">&STDOUT";
  70 }
  71 binmode $out, ":utf8";
  72 print {$out} $feed_with_less_entries->as_xml();
  73 close($out);
  74 
  75 
  76 sub myconvert{
  77   my $feed = shift;
  78   if (
  79       (($output_format eq "RSS") && ($feed->format() eq "Atom")) ||
  80       (($output_format eq "Atom") && ($feed->format() ne "Atom"))
  81      )
  82   {
  83     return $feed->convert($output_format);
  84   }
  85   else {
  86     return $feed;
  87   }
  88 }

SEE ALSO

Plagger is an RSS/Atom manipulation framework built on top of XML::Feed and other modules, which has plug-ins for many common tasks that can be combined together to accomplish all sorts of interesting tasks.