Perl 2002 Advent Calendar: XML::SAX

On the 9th day of Advent my True Language brought to me..

XML::SAX

There are many ways of looking as XML. One way is to think of it as a big tree of nodes, each node holding other nodes that are either text, tag nodes (that in turn can hold other nodes and so on and so on,) or other things like comment. The trouble with treating the XML as a simple tree like this is that you have to hold the entire thing in memory at once in order to deal with it. This is simple when it's something like an XHTML web page, but somewhat more problematic when the data is something huge like an XML document with embedded mpeg encoded movies.

The alternative is to think of the XML as a stream of nodes rather than a tree, where each time your code is called it gets told just one thing, that it's found one item - a start of a tag, a lump of text - and you code gets one chance to hand back modified XML. This is the SAX system. This whole approach is a lot more simple - and arguably a whole lot less powerful, but it does have one good thing going for it. Because of it's simplicity it's really easy to produce very small chunks of code that can be easily plugged together filtering you XML.

When you start plugging together bits of code like this it makes doing complicated things quite easy to do. You can build more and more complex tools that can be tested and evaluated in their own. In addition you can easily build tools that extend other people's tools - much like we did yesterday with the AxPoint example.

Like many ideas - like XML itself, by limiting yourself in a simple way and defining a common standard, you allow cooperation between things that would have been previously hard to do. SAX takes this idea and runs with it - to the stage that even non-XML data can be easily altered with a SAX pipeline by temporally constraining it into the SAX API.

[Read the documentation for XML::SAX on search.cpan.org]

Consider the XML parsing process with a SAX parsing stream.

Your parser is like someone reading out the XML over the phone to you. "There's an opening body tag" it says "then an opening p tag, then some text that says 'hello'...". and you repeat what they say and your friend with the bit paper sitting next to you writes down what you say and he ends up with an identical copy of the XML.

Now the clever bit is that you don't have to say what you heard...you can change the tags you repeat and your friend will end up writing down an altered version of the XML. Or your friend, instead of writing it down, can tell to the person with whom he's on the phone the XML you said - but of course she can in turn change it too so her scribe ends up with a version that both you and she changed, and so on and so on. Now why this is really useful is that none of the people really need to know what the others are up to - as long as they get the right XML in they'll read out the right altered XML. So by swapping the order of who calls who you can really quickly set up new ways of muddling with your data.

And of course the really shocking thing is the people at the start and end doesn't even really have to be reading from an existing XML document of writing out XML - as long as they keep to themselves all the people in the middle just think they're dealing with XML, when really they're taking in a description of a directory structure and turning it into a PDF presentation.

An Example

I knew when I started writing this that I would have so much to write here I couldn't possible hope to code it all. That's why I sneakily snuck in a big example

yesterday

.

Let's start with most basic XML::Filter, the 'do nothing' filter

  package XML::Filter::DoNothing;

  # inherit all the standard event handles from the base class
  use base qw(XML::SAX::Base);

  # turn on all the safety features
  use strict;
  use warnings;
 
  # return true
  1;

This is the most basic class. It inherits all methods from XML::SAX::Base, so it does nothing but pass events on unfiltered. In order to alter the XML, we need to override some events. Here's a simple class that skips nodes.

  package XML::Filter::SkipNodes
  use base qw(XML::SAX::Base);

  use strict;
  use warnings;

  # when a '<foo>' is seen
  sub start_element
  {
     my $this = shift;
     my $tag = shift;

     # is it one we're skipping?
     return undef
        if $this->{skip} eq $tag->{LocalName};

     # call the super class to properly handle the event
     return $this->SUPER::start_element($tag)
  }

  # when a '</foo>' is seen
  sub end_element
  {
     my $this = shift;
     my $tag = shift;

     # is it one we're skipping?
     return undef
        if $this->{skip} eq $tag->{LocalName};

     # call the super class to properly hadnle the event
     return $this->SUPER::end_element($tag)
  }

1;

Okay, so what are the notable features of our example XML::Filter Firstly, note that we didn't create a new method, yet the skip option, like all configuration options passed to the constructor, is automatically stored in the object hash by the inherited constructor.

It's also worth noting how events are handled. The start_element and end_element are called whenever they see the start or end of a tag (or one straight after the other for tags like <br/>.) The tag is passed in as a hash with various values, which are covered later. All events get this - but each event's hash contains different data depending on the type of event it is.

These tags need to be returned at the end of the routine but returning them but only after we have called the SUPER method of the same name as we're in. This is vitally important to both allow chaining to work properly. It might still work in some situations if you return directly, but trust me on this one, sooner or later you'll get yourself into a bad situation.

A instance of our class can be created like this:

  use XML::Filter::SkipNodes;
  my $filter = XML::Filter::SkipNodes->new( skip => 'blink' );

And used in a pipeline like this:

  # load the classes
  use XML::SAX::Machines qw(:all);
  use XML::Filter::SkipNodes;

  # create a pipeline that filters out blinks and marquee and then
  # prints out the XML to the screen
  my $pipeline = Pipeline( 
    XML::Filter::SkipNodes->new( skip => 'blink' ),
    XML::Filter::SkipNodes->new( skip => 'marquee' ),
    \*STDOUT
  );

  # parse the file
  $pipeline->parse_uri("index.xhtml");

In the pipeline construct we can see the implementation of the multiple people on phones mentioned earlier. The first filter takes out the blink tags, and then passes it on. The second takes out the marquee. Note that a XML parser is automatically created for us and we don't have to do anything special. This is one of the handy features of XML::SAX. Through some clever tricks, it can transparently use any XML parser that you subsequently install on the system, and it also ships with a pure Perl based parser that's pretty good, albeit a little slow, so it always has one at hand.

The XML::SAX::Machines have a whole set of other 'Machines' that can be used to run various filters together rather than running tags one after the other in various situations, none of which I'll go over here. I find that the Pipeline is simple enough for most basic situations.

How are the tags made up?

The key to processing the tags is knowing which events are available and what they're passed. This is all documented in the XML::SAX::Base documentation, but the three events that you're probably most interested in will be start_element (when you see an opening XML tag), end_element (when you see a closing XML tag) and characters (where text is shown.) You might also be interested in the start_document and end_document for indications at the start and end of everything.

First, text nodes are the easiest thing to deal with. They are made up of hashes that just have the one key - Data.

  package XML::Filter::UppercaseBuffy;
 
  use strict;
  use warnings;

  sub characters
  {
    my $this  = shift;
    my $chars = shift;

    # make sure that Buffy is captilised.
    $chars->{Data} =~ s/buffy/Buffy/;

    # return it to the parent class
    return $this->SUPER::characters($chars);
  }

1;

One subtle point is that parsers don't always send text through as continuous lumps, and may break sections of text up into two or more distinct blocks calling characters multiple times when you'd normally expect it to just call it once. To avoid that happening you can place an instance of XML::Filter::BufferText before your filter in the pipeline, which will bunch up all the events for you.

The start_element event has a much more complicated tag passed though to it. All the various points are declared in the spec, but rather than forcing you to look though that it's easier just to look at a Data::Dumper style output from one of the nodes. For example

 <p align="center">

hands the following structure to a start_element event handler as it's second argument:

 {
   'Name'         => 'p',
   'LocalName'    => 'p',
   'Prefix'       => '',
   'NamespaceURI' => ''
   'Attributes'   => {
      '{}align'      => {
         'Name'         => 'align',
         'LocalName'    => 'align',
         'Prefix'       => '',
         'Value'        => 'center',
         'NamespaceURI' => ''
       }
     },
 };

As you can see, that's a pretty verbose structure. Because of this it's often easier to change an existing data structure you've been handed and return that rather than create a new one from scratch. You just have to be careful to be consistent - for example, if you change the Name of the tag you also have to change the LocalName

You might be wondering what's with the {} in the attribute name. This is so called

James Clark notation

and it's to do with namespaces. Namespaces are a way of having multiple tags with the same name in the same document with different meanings.

The {} is where the the namespace is inserted before the attribute name if there is one. Let's look at an example with namespaces:

  <perladvent:foo xmlns:perladvent="http://perladvent.org/foo" 
            perladvent:bar="bazz" />

Prints out this mammoth structure (with added comments by me)

 {
   # Name is the name, including the namespace prefix
   'Name'         => 'perladvent:foo',

   # LocalName is the bit after the namespace
   'LocalName'    => 'foo',

   # Prefix is the namespace prefix, the namespace's 'name'
   'Prefix'       => 'perladvent',

   # The URL that makes the namespace unique
   'NamespaceURI' => 'http://perladvent.org/foo'

   # the attributes, keyed by name
   'Attributes' => {

      # the declaration of the namespace
      '{http://www.w3.org/2000/xmlns/}perladvent' => {
         'Name'         => 'xmlns:perladvent',
         'LocalName'    => 'perladvent',
         'Prefix'       => 'xmlns',
         'Value'        => 'http://perladvent.org/foo',
         'NamespaceURI' => 'http://www.w3.org/2000/xmlns/'
      },

      # the 'bar' attribute
      '{http://perladvent.org/foo}bar' => {
         'Name'         => 'perladvent:bar',
         'LocalName'    => 'bar',
         'Prefix'       => 'perladvent',
         'Value'        => 'bazz',
         'NamespaceURI' => 'http://perladvent.org/foo'
      },
   };

Yes, I know this is confusing, but rather than worrying about it too much you can make your life simple by remembering a simple rule:

If you're not using namespaces at all, then all you need to remember is to prefix attributes names with a {} when looking them up in the Attributes hash.

Existing Modules

I'm not going to talk too much about the various extensions, but quite a few already exist. Try searching the CPAN for XML::Filter and XML::Generator modules.

XML::SAX::Base

XML::SAX::Machines

XML::SAX::Pipeline

Dom's London.pm lightning talk on SAX (pdf)

Understanding XML::SAX::Machines Part One on XML.com

Understanding XML::SAX::Machines Part Two on XML.com

Transforming XML Wit SAX Filters on XML.com

Writing SAX Drivers for Non-XML Data on XML.com