Perl 2003 Advent Calendar: HTML::Entities

On the 2nd day of Advent my True Language brought to me..

HTML::Entities

Yesterday's advent calendar talked about making sure the data your CGI accepts is checked carefully, and in a mirror of that, today's entry talks about making sure the data you output to a HTML page is formatted properly.

Just as failing to check the data you're getting in can cause you security problems, so can failing to check the data that you print out. It can open you up to cross site scripting attacks, or, at the very least cause your pixel perfect layout to horribly break when your variable doesn't quite contain what you thought it did when you printed it.

HTML::Entities is a simple module that can help you out here, translating strings back and forth between a normal string and one that's encoded in a form that's safe to place in the middle of a HTML document. Importantly, it's capable of handling the edge cases that a simple regular expression based solution doesn't deal with, ensuring that things you print out are safe.

[Read the documentation for HTML::Entities on search.cpan.org]

It's easy to get caught out when you're printing user data in HTML documents. Let's consider this basic example script that prints a form asking for your name, and once that form has been submitted back to the script prints out a hello greeting.

  #!/usr/bin/perl
  
  # turn on perl's safety features
  use strict;
  use warnings;

  # get the object
  use CGI;
  my $cgi = CGI->new();

  # print the header line to tell the browser we're using HTML
  print $cgi->header;

  # print the correct HTML depending on if a name has been
  # submitted or not.
  my $name = $cgi->param("name");
  if (!defined($name))
  { 
     print <<ENDOFHTML;
  <html>
   <body>
    <form method="POST">
     what is your name? <input type="text" name="name" /><br />
     <input type="submit" name="submit" value="Submit" />
    </form>
   </body>
  </html>
  ENDOFHTML
  }
  else
   { print "<html><body>Hello $name</body></html>"; }

The scary thing is that this (broken) example will work fine, most of the time. Most input people enter into a browser is valid html. Things like "Mark" will just work. It'll spit out this:

  Content-Type: text/html;

  <html><body>Hello Mark</body></html>

Which will render in the browser as:

  Hello Mark

The trouble comes when someone decides to put something odd into the form and we print it out without processing it as we do in the code example above. As most people who've written html know, this can be something that we'd normally consider quite inoxious like an ampersand & or left angle bracket <. Left angle brackets are used to start tags, and ampersands are used to create 'entities'. Entities are used for things that can't be represented in the character set that the HTML is using, like &eacute for a lowercase e with an acute accent. Entities are also used for encoding the special characters themselves, for where you want to use something like < or & without having any special meaning. In day to day HTML output this means changing all the ampersands to &amps; and left arrow brackets to < in your string.

For example, if someone places Rod, Jane & Freddy into the form, then we should encode the entities in that so it prints out this:

  Content-Type: text/html;

  <html><body>Hello Rod, Jane &amp; Freddy</body></html>

Whereas our previous example will only print out:

  Content-Type: text/html;

  <html><body>Hello Rod, Jane & Freddy</body></html>

Worse, someone could put <img src="http://python.org/pics/pythonHi.gif"> directly into the form, meaning that the script would print out without a second thought:

  Content-Type: text/html;

  <html><body>Hello <img src="http://python.org/pics/pythonHi.gif"></body></html>

And the browser would go ahead and load an image where you'd previously just wanted to be able to show a simple name.

Using HTML::Entities

HTML::Entities is one of the easiest Perl modules to use. It simply exports two functions into the callers namespace. These can be used to encode or decode strings to or from a form that can be printed directly into a HTML document.

  use HTML::Entities;

  # define my string
  my $string = "Rod, Jane & Freddy";
  
  # encode it
  encode_entities($string);

  # print it
  print "$string\n";

This prints

  Rod, Jane &amp; Freddy

Note how the encode_entities function encodes the string in place, altering the value it was passed. If you don't want to alter the existing value, you can simply assign the result of encode_entities to a new variable, which will leave the original value untouched.

  use HTML::Entities;

  # define my string
  my $string = "Rod, Jane & Freddy";
  
  # encode it
  my $newstring = encode_entities($string);

  # print it
  print "old: $string\n" .
        "new: $newstring\n";

This prints:

  old: Rod, Jane & Freddy
  new: Rod, Jane &amp; Freddy

You also have control over which entities you actually want encoded. By default characters encoded are control chars, high-bit chars, and the <, &, >, and " characters, but if you want to encode other characters then you might want to alter how you encode things. For example, the ' character is not encoded by default, but if you're creating XML and using it to delimit you attribute values you might want to encode any apostrophe that you're placing in an attribute. You can instruct HTML::Entities what to encode by placing the characters to be escaped in a string which you pass as a second argument to encode_entities like so:

  use HTML::Entities;
  my $string = encode_entities("Mark's House","'")
  print "<meet where='$string' />";

This prints:

  <meet where='Mark&#39;s House' />

Note that only the entities contained in the string are encoded, so things like < would not be encoded in the above example.

The decode_entities function works essentially the same way as the encode_entities function but in reverse.

  use HTML::Entities;

  # define my string
  my $string = "Rod, Jane &amp; Freddy";
  
  # decode it
  my $newstring = decode_entities($string);

  # print it
  print "old: $string\n" .
        "new: $newstring\n";

The Potential Security Concerns.

In a lot of cases if you break your HTML by not encoding a ampersand or left angle bracket modern browsers will cope, or at the worst you won't be able to display your web page properly. It's a bad thing to do, but it's not the end of the world. What you've really got to worry about is people actually inserting extra HTML into your document.

On a minimal level this can be used by people to add extra markup that we might not want them to be able to add, for example making the text they've submitted as a comment on a journal entry bigger, or a different colour. Or worse, they could link in inappropriate pictures. All these things can range from making your site look very ugly, to meaning your site might seem to be displaying copyrighted or indecent pictures.

While these matters are pretty bad, it pales in comparison to the risk of having someone place some Javascript in your page. The most basic attack someone could do is place some code to redirect anyone looking at your page to another page on the web:

  <script>document.location="http://perladvent.org"</script>

This will cause the browser to stop displaying your page and immediately load up this advent calendar.

Even worse still is the problem of cross site scripting. Javascript can access the cookies that are stored for your site, and these can contain private information like your session details. If someone sends an email which is subsequently read via a web based email client which doesn't encode it's HTML properly it might be possible to send an email with some Javascript in it that's executed when the email is read. One thing this Javascript could do is steal the session info from the site cookie and send it to attacker by encoding it in a URL for an image that the Javascript requests from the attacker's webserver. With this session info the attacker could read the person's email.

This so called Cross Site Scripting is one of the growing number of attacks that can be used against sloppily coded sites.

Preventing Cross Site Scripting