The 2002 Perl Advent Calendar
[about] | [archives] | [contact] | [home]

On the 1st day of Advent my True Language brought to me..
URI::Find

URI::Find is a module that is able to find URLs contained within normal text. It's the module that can recognise the http://use.perl.org/ or (with the URI::Find::Schemeless extension) the www.perl.com URL contained in your plain text email for you. It can be used for equally well for compiling lists of urls or altering the string to markup the urls - for example wrapping them in <a href="..."> ... </a> > tags.

It's always tempting to use a simple regular expression to do this kind of thing yourself, but more often than not you'll miss a few cases (like implicit URLs starting with "ftp." for example) and before you know it what was a simple task will turn into an hour or twos work trying to fix one little problem after another. It's much better to use some code some other poor individual's had to slave over and deal with the countless bug reports than worrying about the little things yourself.

The most simple way to use URI::Find is to simply use it to count urls:

  use URI::Find;
  # create a new finder object that we'll use to find urls
  my $finder = URI::Find::Schemeless->new();
  # find urls in $text
  my $matches = $finder->find(\$text);  

This is all very well and good, but knowing how many URLs are in a page isn't actually that useful. What we really want to do is wrap them in some links.

To do more complicated things than just count urls, URI::Find uses what is known as a 'callback system'. Each separate URI::Find object you create can be set up with it's own subroutine reference - it's own "lump of code" - that Perl should call every time that URI::Find object finds a URL so that lump of code can do something useful with that URL.

This can look as simple as passing a reference to an existing subroutine like so:

  # call "found_url" whenever you find a url
  my $finder = URI::Find::Schemeless->new(\&found_url);
  sub found_url
  {
    ...
  }
  # search $text with that finder
  $finder->find(\$text);

Or alternatively, you can use Perl's anonymous subroutine system to create a one off subroutine there and then that just does what you want to do:

  # run this code whenever you find a url
  my $finder = URI::Find::Schemeless->new(sub { ... });
  # search $text with that finder
  $finder->find(\$text);

So, looking at this, let's create something that wraps all our urls in <a href="..."> ... </a> tags. The subroutine called by URI::Find gets called with a URI::URL object (an object representing a url) and the original text as it's arguments, and it should return the text that replaces the existing text.

  my $finder = URI::Find::Schemeless->new( sub {
    my $uri    = shift;  # object representing the url
    my $string = shift;  # text that was in the url
    # return the replacement text, i.e. the same text
    # wrapped in <a href="..."> ... </a>
    return '<a href="'.
	   $uri->abs.  # get the absolute address
           '">'.
           $string.    # keep the original text
           '</a>';
  });
  # and process the text through that finder
  $finder->find(\$text);

Note how we use the absolute address from the URI::URL object rather than just using $text. This is because if the text is "www.perlmonks.org" without a preceding "http://" then the browser would treat it as a relative url...which is not what we want so we ask the url object for the full url to include each time.

We have seen how we can replace the text with other text, allowing us to markup URLs. But what if we want to create a list of urls in the document or something similar? How would we go about collecting the URLs?

The answer is to use Perl's "closure" system, which in this case is nothing more complicated than knowing that the anonymous subroutine can see variables that we've created outside of the subroutine too.

  # a list of the urls
  my @urls;
  my $finder = URI::Find::Schemeless->new(sub {
    my $uri    = shift;  # object representing the url
    my $string = shift;  # text that was in the url
    # remember the $uri's address by adding it to @urls which
    # we can see from within this sub
    push @urls, $uri->abs;
    # return the original text back to leave the text
    # we were searching unaltered
    return $string;
  });
  # and process the text through that finder
  $finder->find(\$text);
  # and create a web page of those links (for example)
  print "<html><body>URLs were:<ul>";
  foreach my $url (@urls)
  {
    print qq{<li><a href="$url">$url</a></li>};
  }
  print "</ul></body></html>";

  • URI::Find::Schemeless
  • URI::URL
  • RFC 2396