XML::LibXML is an interface to the Gnome libxml XML parsing library, which is very very fast indeed. It provides a large number of features and it provides them well. It's fully buzzword compliant mentioning words like SAX, DOM, and XPath. Best of all, it allows you to mix these technologies freely, meaning that you can very quickly get at the right tool for the job.
Oh, and if that wasn't enough it's really really fast. What more could you ask for from an XML parser?
I've already written about using XML::LibXML to parse HTML in the
XPath is commonly called "to XML what regular expressions are to strings." What this rather befussling statement is referring to is that XPath statements, like regular expressions, are not really the instructions of how to find the data we're after, but just really a statement of the goal - the specification of the thing we're trying to find. For example:
# create a new parser use XML::LibXML; my $parser = XML::LibXML->new();
# parse the document my $doc = $parser->parse_file("myxml.xml");
# get the <img> tags my @img = $doc->findnodes("//img");
See that the statement doesn't really say how one should go about finding the nodes in the document. The "//" simply means 'matching anywhere in the document' and "img" is the name of the tag - it's just a spec. XPath allows you to specify quite complicated specifications that pretty much allow you to select exactly the node you want from a document. There's quite a good
As well as returning tag nodes, XPath can be used to return the value of things. For example, all the text inside the link to perl.com could be found like so:
my @text = $doc->findvalue('//a[@href="http://www.perl.com/"]/text()");
The truly most useful thing about XPath is that you can execute
findnodes
and findvalues
on xml nodes you have been returned
before and it'll start a new search relative to the node you called
the search on.
# get all the paragraphs foreach my $p_node ($doc->findnodes("//p")) { # print the text contained in the node print $p_node->findvalue("./text()");
# underneath it print the images' urls foreach my $attr ($p_node->findnodes('./img/@src')) { # get the value of that attribute print "[IMG: " . $attr->findvalue(".") . "]\n"; }
# and some spacing print "\n\n"; }
This XPath shenanigans is all great and wonderful, but it can get tiresome very quickly if all you want to do is a simple operation. XPath is good at the really complex stuff, but seems overly complicated if all you want to do is get a tag's parent tag. This is why XML::LibXML has a DOM like interface that can be applied at any stage to any node.
# for every image in a paragraph foreach my $img_node ($doc->findnodes("//p/img")) { # print the image url print $img_node->getAttribute("src") . ":\n";
# print the text of the paragraph it's in print $img_node->parentNode ->textContent; }
This is one of the great features of XML::LibXML - you can mix and match the different XML techniques using whichever suits the task at hand best. You can even get it to produce a stream of SAX events for you from the current document if you want.
A common form of navigation on websites is what is known as "Breadcrumb" navigation. This is where you have a section at the top of your page that looks somewhat like this:
Home > Computers > Languages > Perl > Advent Calendar
where each element represents a subsection that can be clicked upon to move up a level to a more general area. We can use XML::LibXML to add this navigation to each of our webpages.
Firstly we need to edit each of our HTML files so that they contain
the new tag <breadcrumb name="Advent Calendar" />
. We then
can write a script that can run though all the webpages it can find and
replace these tags with the actual breadcrumb navigation.
#!/usr/bin/perl # turn on Perl's safety features use strict; use warnings;
use File::Find::Rule; use Tie::File;
# for every file we can find in the directory that was passed # in on the command line foreach my $file (File::Find::Rule->file ->name("*.html") ->in($ARGV[0])) { # tie that file to lines so that we can edit it tie my @lines, "Tie::File", $file;
# search each line in turn. If we knew our files were # real xml we could have used XML::LibXML to do this, # but for now let's just use a regular expression. foreach my $line (@lines) { # replace the tag with the results of the # breadcrumb function (use 'e' on regex to # run perl code for replacement text)
$line =~ s{ < # start of the tag \s* # optional whitespace breadcrumb # the word breadcrumb \s* # optional whitespace name # name \s* # optional whitespace = # equals \s* # optional whitespace "(.*?)" # the contents of the attribute \s* # optional whitespace /?> # the end of the tag ('/' optional) }{ # replace it with the results of the # breadcrumb function breadcrumb($1) }giex }
untie @lines; }
So, that's the script that finds the tags in the file. Now we need some kind of mapping between page names and page urls. This is provided by the simple XML file below. Each page as a url and a name, and may conceptually 'contain' other pages.
<?xml version="1.0">
<page name="Home" url="/"> <page name="Computers" url="/comp"> <page name="Languages" url="/comp/lang"> <page name="Jako" url="/jako" /> <page name="Scheme" url="/sheme" /> <page name="Perl" url="/perl" /> <page name="Advent Calendar" url="/perl/advent" /> <page name="Modules List" url="/perl/mods" /> </page> </page> </page>
(Note that this example would normally contain many more pages, I'm just being brief)
And now we just need to write the breadcrumb
function that reads in
the file and spits out the correct html for navigation. Once it's
parsed the xml file all it really needs to do is look up the correct
node with an XPath expression, and then move up the tree of nodes
creating the html for a navigation link (a 'crumb') as we go until we
reach the root node.
my $doc;
sub breadcrumb { my $nodename = shift;
# parse the map if we haven't done this already unless ($doc) { use XML::LibXML; my $parser = XML::LibXML->new(); $doc = $parser->parse_file("map.xml"); }
# find the node that we're interested in (the 'page' node # with the same name attribute) my ($node) = $doc->findnodes('//page[@name="'.$nodename.'"]') or die "Can't find page for name '$nodename'"; # find the top node my ($root) = $doc->findnodes("/*");
# the output string we're building up my $string = $nodename; # keep getting the parent node and making it a crumb # while $node isn't the root node while (!$node->isSameNode($root)) { # move up a node $node = $node->parentNode();
# create the crumb and add it to the start of the string use HTML::Entities; $string = '<a href="' . encode_entities($node->getAttribute('url')) . '">' . encode_entities($node->getAttribute('name')) . '</a> $gt; ' . $string; }
# return the string return "<p>$string</p>"; }