Perl 2002 Advent Calendar: XML::LibXML

On the 22nd day of Advent my True Language brought to me..

XML::LibXML

XML::LibXML is an interface to the Gnome libxml XML parsing library, which is very very fast indeed. It provides a large number of features and it provides them well. It's fully buzzword compliant mentioning words like SAX, DOM, and XPath. Best of all, it allows you to mix these technologies freely, meaning that you can very quickly get at the right tool for the job.

Oh, and if that wasn't enough it's really really fast. What more could you ask for from an XML parser?

[Read the documentation for XML::LibXML on search.cpan.org]

I've already written about using XML::LibXML to parse HTML in the

entry on WWW::Mechanize

. That was an example of using the XPath notation to find parts of the HTML.

XPath is commonly called "to XML what regular expressions are to strings." What this rather befussling statement is referring to is that XPath statements, like regular expressions, are not really the instructions of how to find the data we're after, but just really a statement of the goal - the specification of the thing we're trying to find. For example:

   # create a new parser
   use XML::LibXML;
   my $parser = XML::LibXML->new();

   # parse the document
   my $doc = $parser->parse_file("myxml.xml");

   # get the <img> tags
   my @img = $doc->findnodes("//img");

See that the statement doesn't really say how one should go about finding the nodes in the document. The "//" simply means 'matching anywhere in the document' and "img" is the name of the tag - it's just a spec. XPath allows you to specify quite complicated specifications that pretty much allow you to select exactly the node you want from a document. There's quite a good

example lead tutorial

on XPath on zvon.org.

As well as returning tag nodes, XPath can be used to return the value of things. For example, all the text inside the link to perl.com could be found like so:

  my @text = 
    $doc->findvalue('//a[@href="http://www.perl.com/"]/text()");

The truly most useful thing about XPath is that you can execute findnodes and findvalues on xml nodes you have been returned before and it'll start a new search relative to the node you called the search on.

  # get all the paragraphs
  foreach my $p_node ($doc->findnodes("//p"))
  {
    # print the text contained in the node
    print $p_node->findvalue("./text()");

    # underneath it print the images' urls
    foreach my $attr ($p_node->findnodes('./img/@src'))
    {
      # get the value of that attribute
      print "[IMG: " . $attr->findvalue(".") . "]\n";
    }

    # and some spacing
    print "\n\n";
  }

This XPath shenanigans is all great and wonderful, but it can get tiresome very quickly if all you want to do is a simple operation. XPath is good at the really complex stuff, but seems overly complicated if all you want to do is get a tag's parent tag. This is why XML::LibXML has a DOM like interface that can be applied at any stage to any node.

  # for every image in a paragraph
  foreach my $img_node ($doc->findnodes("//p/img"))
  {
     # print the image url
     print $img_node->getAttribute("src") . ":\n";

     # print the text of the paragraph it's in
     print $img_node->parentNode
                    ->textContent;
  }

This is one of the great features of XML::LibXML - you can mix and match the different XML techniques using whichever suits the task at hand best. You can even get it to produce a stream of SAX events for you from the current document if you want.

The Breadcrumb Example

A common form of navigation on websites is what is known as "Breadcrumb" navigation. This is where you have a section at the top of your page that looks somewhat like this:

  Home > Computers > Languages > Perl > Advent Calendar

where each element represents a subsection that can be clicked upon to move up a level to a more general area. We can use XML::LibXML to add this navigation to each of our webpages.

Firstly we need to edit each of our HTML files so that they contain the new tag <breadcrumb name="Advent Calendar" />. We then can write a script that can run though all the webpages it can find and replace these tags with the actual breadcrumb navigation.

  #!/usr/bin/perl
  
  # turn on Perl's safety features
  use strict;
  use warnings;

  use File::Find::Rule;
  use Tie::File;

  # for every file we can find in the directory that was passed
  # in on the command line
  foreach my $file (File::Find::Rule->file
                                    ->name("*.html")
                                    ->in($ARGV[0]))
  {
    # tie that file to lines so that we can edit it
    tie my @lines, "Tie::File", $file;

    # search each line in turn.  If we knew our files were 
    # real xml we could have used XML::LibXML to do this, 
    # but for now let's just use a regular expression.
    foreach my $line (@lines)
    {
       # replace the tag with the results of the 
       # breadcrumb function (use 'e' on regex to
       # run perl code for replacement text)

       $line =~ s{ <           # start of the tag
                   \s*         # optional whitespace
                   breadcrumb  # the word breadcrumb
                   \s*         # optional whitespace
                   name        # name
                   \s*         # optional whitespace
	           =           # equals
                   \s*         # optional whitespace
                   "(.*?)"     # the contents of the attribute
                   \s*         # optional whitespace
                   /?>         # the end of the tag ('/' optional)
                 }{ 
                    # replace it with the results of the 
                    # breadcrumb function
                    breadcrumb($1)
                  }giex  
    }

    untie @lines;
  }

So, that's the script that finds the tags in the file. Now we need some kind of mapping between page names and page urls. This is provided by the simple XML file below. Each page as a url and a name, and may conceptually 'contain' other pages.

  <?xml version="1.0">

  <page name="Home" url="/">
   <page name="Computers" url="/comp">
    <page name="Languages" url="/comp/lang">
     <page name="Jako"   url="/jako"  />
     <page name="Scheme" url="/sheme" />
     <page name="Perl"   url="/perl"  />
      <page name="Advent Calendar" url="/perl/advent" />
      <page name="Modules List"    url="/perl/mods" />
     </page>
    </page>
   </page>

(Note that this example would normally contain many more pages, I'm just being brief)

And now we just need to write the breadcrumb function that reads in the file and spits out the correct html for navigation. Once it's parsed the xml file all it really needs to do is look up the correct node with an XPath expression, and then move up the tree of nodes creating the html for a navigation link (a 'crumb') as we go until we reach the root node.

  my $doc;

  sub breadcrumb
  {
    my $nodename = shift;

    # parse the map if we haven't done this already
    unless ($doc)
    {
      use XML::LibXML;
      my $parser = XML::LibXML->new();
      $doc = $parser->parse_file("map.xml");
    }

    # find the node that we're interested in (the 'page' node
    # with the same name attribute)
    my ($node) = $doc->findnodes('//page[@name="'.$nodename.'"]')
     or die "Can't find page for name '$nodename'";
  
    # find the top node
    my ($root) = $doc->findnodes("/*");

    # the output string we're building up
    my $string = $nodename;
    
    # keep getting the parent node and making it a crumb 
    # while $node isn't the root node
    while (!$node->isSameNode($root))
    {
      # move up a node
      $node = $node->parentNode();

      # create the crumb and add it to the start of the string 
      use HTML::Entities;
      $string = '<a href="' . 
                encode_entities($node->getAttribute('url')) .
                '">' .
                encode_entities($node->getAttribute('name')) .
                '</a> $gt; ' .
                $string;
    }

    # return the string
    return "<p>$string</p>";
  }

XML::LibXML - An XML::Parser Alternative article on perl.com

Tutorial on XPath

Template::Plugin::XML::LibXML (template toolkit plugin for XML::LibXML)

XML::XPath (XML::Parser based XPath module)

O'Reilly's Perl & XML book