The xml_grep utility and XML-Twig 101 are good intros to the XML::Twig suite, but that does not make for a very interesting Calendar entry.
As you may have noticed, this year's Advent Calendar layout does not obscure the picture with the calendar squares (window shutters), but also does not have a full 25 windows. Since we have missed some days the last few years, this should not be a problem, right? Well, the X-Y coords and dates are hard coded, so skipping a day requires some fussy work. This clearly calls for a Perl script to parse the HTML file and increment the Day of all unused entries (only), to be run when we take a skip day. In case the authors and editors are very naughty or lazy and skip more than 3 days -- and thus will be getting coal in their stocking -- the script can even remove the last box(es) to delete a day incrementing beyond 25th.
perl -i.bak mod6.pl index.html
mv index.html.bak index.before.html
diff -u index.before.html index.html | tee result-diff.txt
...
-<br><div class="q"><a href="4/" style="left:425px; top: 5px">4</a></div>
-<br><div class="q"><a href="" style="left:545px; top: 5px">5</a></div>
-<br><div class="q"><a href="" style="left:665px; top: 5px">6</a></div>
...
-<br><div class="C"><a href="" style="left: 5px; top: 52px">22</a></div>
...
+<br /><div class="q"><a href="4/" style="left:425px; top: 5px">4</a></div>
+<br /><div class="q"><a href="" style="left:545px; top: 5px">6</a></div>
+<br /><div class="q"><a href="" style="left:665px; top: 5px">7</a></div>
...
+<br /><div class="C"><a href="" style="left: 5px; top: 52px">23</a></div>
The result is fairly subtle: The Diff as above is somewhat opaque. if you viewed this by opening the calendar page door -- if not, you miss half the traditional seasonal fun -- you may have noticed 5 missing; if not, you may need to compare closely to the previous page state, to see that the boxes from 5..22 were incremented.
(Yes Virginia, there will be a Christmas door for the 25th.)
Since it's acting as XML, Twig does not normally respect original whitespace but the keep_spaces = 1> option assists us here.
XML-Twig is allergic to some HTML that is NOT well-formed XML, so until we upgrade to XHTML, the script needs to insert end-slash in to empty tags as needed and the HTML-only Entities.1
1 use XML::Twig; 2 use 5.010; 3 4 my $t = XML::Twig->new( 5 pretty_print => 'indented', # output nicely formatted 6 keep_spaces => 1, # wrap as original layout 7 empty_tags => 'html', # outputs <empty_tag /> 8 ); 9 10 $contents = do { local $/; <> }; # slurp scarf 11 12 # repair html to wellformed xml 13 $contents =~ s[< $_ (?: \b [^>]*? [^/])? \K >][ />]gxism for qw[ link br img]; 14 $contents =~ s[·][·]gxism; 15 16 eval { $t->parse($contents) } 17 or die "$@ \n contents = $contents\n"; 18 ; 19 my $root = $t->root; 20 my @para = 21 $root->get_xpath('.//a[@href=""]'); # get the children [@class="q"]/a 22 foreach my $para (@para) { 23 $para->set_text( $para->text() + 1 ); 24 $para->delete() if $para->text() > 25; 25 } 26 27 # output the document 28 $contents = $t->sprint($root); 29 $contents =~ s[\xb7][·]gxism; # restore html entities 30 print $contents; 31
1. This does mean that <p> tags need close tags </p>, and tags must be properly nested, not straddled like <b><i> blah </b></i>.