YA Perl Advent Calendar 2005-12-16

Somebody asked for an RSS feed because they love the Advent calendar so much, yet being a perl hacker they also exhibited the virtue of Laziness and didn't feel like checking every 20 minutes to see if I'd released the day's writeup yet. Besides, Mark used to have one. I, on the other hand, did not feel it was worthwhile for such an ephemeral urgency; paricularly since there was no question as to whether there was to be an update rather only when.

On the other hand our previous guest author, William 'N1VUX' Ricker, was interested in learning a little about RSS and felt it a worthy endeavor. We thereore proudly you present with an RSS feed Enjoy!

P.S. If you're at a loss for uses/relevance of RSS you might try something like this, creating a feed for your favorite comics.

Adding RSS to the Calendar isn't as bad as I thought — there's a "Simple" module, from Sean Burke of course - XML::RSS::SimpleGen.

After the usual painless install and a little cargo-cult hacking of the POD example we're already half done ..

$ cat yapac.rss # first version
<?xml version="1.0"?>
<?xml-stylesheet title="CSS_formatting" type="text/css" 
    href="http://www.interglacial.com/rss/rss.css"?>
<rss version="2.0"  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/">
<channel>
<!-- Generated with Perl's XML::RSS::SimpleGen v11.11 -->
<link>http://web.mit.edu/belg4mit/www/</link>
<title>YAPAC</title>
<description>Yet Another Perl (Advent) Calendar</description>
<language>en</language>
<lastBuildDate>Thu, 15 Dec 2005 02:31:27 GMT</lastBuildDate>
<skipHours><hour>0</hour><hour>1</hour><hour>3</hour>...<hour>23</hour></skipHours>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updatePeriod>daily</sy:updatePeriod>
<sy:updateBase>1970-01-01T02:40+00:00</sy:updateBase>
<ttl>1440</ttl>
<webMaster>jpierce@cpan.org</webMaster>
<docs>http://www.interglacial.com/rss/about.html</docs>

<item>  
	<title>1..5</title> 
	<link>http://web.mit.edu/belg4mit/www/5/</link>
	<description>On day 5/, my true language gave to me</description>
</item>

<item>  
	<title>6</title>  
	<link>http://web.mit.edu/belg4mit/www/6/</link>  
	<description>On day 6/, my true language gave to me</description>
</item>
etc...
</channel></rss>

Since the main page doesn't have descriptions or titles, there's a slight problem with using the POD's main-page scrape generating descriptions. So we either have to fill in a default as above, or fetch the page (or the Header of the page) to grab <title> to fill in the <description>. However, that was already discussed on Catchup Day 1..5, so it shouldn't be too hard.

So what's left to do after the quick hack?

TODO after first 20 minutes ...

  1. Scrape day pages to get titles to replace main-page scrape of $3 for description.
  2. What about this rss_history_file thing?
  3. Hook the (MIT) favicon as RSS image?
  4. Figure out how to make it appear in FireFox for subscribing
  5. Test
  6. Validate RSS format
  7. Lather, Rinse, Repeat
So with the second round of TUITs, we work the Todo list ...

1. Scrape day pages to get titles

We need to replace main-page scrape of $3 in regular expression for description with a fetch of the <title>. (We could put in the name of the module, but that would be telling.) The first attempt at this used LWP::Simple and HTML::HeadParser while the main page parse was still in process ... which cancelled the main HTML parse, bad. Apparently the differerent parsers share under the hood. So we switch to an agenda.

Let's add a little trace output so we know what it's doing ...

$ perl -I XML-RSS-SimpleGen-11.11: modXRS.pl
5/ 1..5 'YA Perl Advent Calendar 2005: Catchup'
6/ 6 'YA Perl Advent Calendar 2005: On the ordinate(6) day of X-Mas'
7/ 7 'YA Perl Advent Calendar 2005-12-07'
8/ 8 'YA Perl Advent Calendar 2005: On the 8E00000000 day of Advent my True Language brought to me...'
9/ 9 'YA Perl Advent Calendar 2005: Buzzword Bingo'
10/ 10 'YA Perl Advent Calendar 2005: Tarball Toolbelt'
11/ 11 'YA Perl Advent Calendar 2005: Conjunction Junction'
12/ 12 'YA Perl Advent Calendar 2005: re-run'
13/ 13 'YA Perl Advent Calendar 2005: A penny saved is a penny earned'
14/ 14 'YA Perl Advent Calendar 2005: Keeping it clean'
15/ 15 'YA Perl Advent Calendar 2005: SCALAR(0xdeadbeef)'

Note: It won't recreate a file if the results don't differ, so for testing I'm rm-oving the file each time so I can see if it actually creates something.

2. What about this rss_history_file? thing

SKIP: Well, no, we don't want this, it would only show today's link... we'll never have more than 25 items anyway.

3. Hook the (MIT) favicon as RSS image

SKIP: This seems easy, the MIT favicon is Icon, so to make that the icon of the RSS feed we add an rss_image line, but a favicon isn't really intended for this. An RSS image porbably needs to be larger for use in the tab of a feed-reader.

4. Make it appear in FireFox

In <head> section, put
<link rel="alternate" type="application/rss+xml" title="RSS" href="../yapac-rss.xml">

5. Test.

It works! The same link-ing convention, but with ./, would work on the main page.

6. Validate RSS format

The web app validator that I tested this with, http://feedvalidator.org, was recommended in the module POD. The webserver that I parked the RSS file on on for testing served a *.rss file as text/plain, which gets a warning from feedvalidator, so I changed the generated file type to *-rss.xml.

7. Lather, Rinse, Repeat

To automate the update process you probably want to add this script to cron, or otherwise integrate it into your publication procedure.

mod16.pl


   1 #!/usr/bin/env perl
   2 use warnings;
   3 use strict;
   4 use Carp;
   5 
   6 sub utility::get_title;
   7 
   8 # A complete screen-scraper and RSS generator 
   9 # adapted from XML::RSS::SimpleGen POD
  10 
  11 use strict;
  12 use XML::RSS::SimpleGen;
  13 my $url = q<http://web.mit.edu/belg4mit/www/>;
  14 
  15 rss_new( $url, "YAPAC", "Yet Another Perl (Advent) Calendar" );
  16 rss_language( 'en' );
  17 rss_webmaster( 'jpierce@cpan.org' );
  18 # image is not supposed to be a favicon, but a GIF, skip for now.
  19 # rss_image("http://yourpath.com/icon.gif",32,32);
  20 rss_daily();
  21 
  22 get_url( $url );
  23 my @pages;  # List of things to process
  24 
  25 while(
  26       # was
  27       #  m{<h4>\s*<a href='/(.*?)'.*?>(.*?)</a>\s*</h4>\s*<p.*?>(.*?)<a href='/}sg
  28       # now must match
  29       # <br><div><a href="10/" style="left: 375px; top: 255px;">10</a></div>
  30       
  31       m{<div> \s* <a \s href="(\d+/)" [^>]* > ([^<>]*) </a> \s* </div> }xisg
  32       
  33      ) {
  34   
  35   my ($page, $linkText, $title)=($1,$2, undef); #$3 is empty
  36   
  37   # Defer with agenda
  38   push @pages, {page=>$page, link=>$linkText, title=>$title};
  39 }
  40 
  41 # now work the agenda, once we've finished the previous parse.
  42 for my $pageRef (@pages) {
  43   my ($page, $link, $title)=(@$pageRef{qw{page link title}});
  44   $title ||= utility::get_title($page) || "Advent Calendar Page - No Title";
  45   print "$page $link '$title' \n";
  46   rss_item("$url$page", $link, $title ) ;
  47 }
  48 
  49 
  50 croak "No items in this content?! {{\n$_\n}}\nAborting"
  51   unless rss_item_count();
  52   
  53 rss_save( 'yapac-rss.xml', 45 );
  54 print "success\n";
  55 
  56 exit;
  57 
  58 ### Reuse HTML::HeadParser from day 5
  59 package utility;
  60 use LWP::Simple;
  61 use HTML::HeadParser;
  62 use Carp;
  63 
  64 # not safe for mod_perl ...
  65 
  66 sub get_title {
  67   my $header = HTML::HeadParser->new();
  68   my $date = shift || croak "get_title requires arg of page name";
  69   
  70   my $content = get( $_ = "http://web.mit.edu/belg4mit/www/$date");
  71 
  72   unless( $content ) {
  73     warn("No content for: $_\n");
  74     return;
  75   }
  76   
  77   $header->parse($content);
  78   return $header->header('Title');
  79 }