Yesterday's advent calendar talked about making sure the data your CGI accepts is checked carefully, and in a mirror of that, today's entry talks about making sure the data you output to a HTML page is formatted properly.
Just as failing to check the data you're getting in can cause you security problems, so can failing to check the data that you print out. It can open you up to cross site scripting attacks, or, at the very least cause your pixel perfect layout to horribly break when your variable doesn't quite contain what you thought it did when you printed it.
HTML::Entities is a simple module that can help you out here, translating strings back and forth between a normal string and one that's encoded in a form that's safe to place in the middle of a HTML document. Importantly, it's capable of handling the edge cases that a simple regular expression based solution doesn't deal with, ensuring that things you print out are safe.
It's easy to get caught out when you're printing user data in HTML documents. Let's consider this basic example script that prints a form asking for your name, and once that form has been submitted back to the script prints out a hello greeting.
#!/usr/bin/perl # turn on perl's safety features use strict; use warnings;
# get the object use CGI; my $cgi = CGI->new();
# print the header line to tell the browser we're using HTML print $cgi->header;
# print the correct HTML depending on if a name has been # submitted or not. my $name = $cgi->param("name"); if (!defined($name)) { print <<ENDOFHTML; <html> <body> <form method="POST"> what is your name? <input type="text" name="name" /><br /> <input type="submit" name="submit" value="Submit" /> </form> </body> </html> ENDOFHTML } else { print "<html><body>Hello $name</body></html>"; }
The scary thing is that this (broken) example will work fine, most of the time. Most input people enter into a browser is valid html. Things like "Mark" will just work. It'll spit out this:
Content-Type: text/html;
<html><body>Hello Mark</body></html>
Which will render in the browser as:
Hello Mark
The trouble comes when someone decides to put something odd into the
form and we print it out without processing it as we do in the code
example above. As most people who've written html know, this can be
something that we'd normally consider quite inoxious like an ampersand
&
or left angle bracket <
. Left angle brackets are used to
start tags, and ampersands are used to create 'entities'. Entities
are used for things that can't be represented in the character set
that the HTML is using, like é for a lowercase e with an acute
accent. Entities are also used for encoding the special characters
themselves, for where you want to use something like <
or &
without having any special meaning. In day to day HTML output this
means changing all the ampersands to &s; and left arrow brackets to
< in your string.
For example, if someone places Rod, Jane & Freddy
into the form,
then we should encode the entities in that so it prints out this:
Content-Type: text/html;
<html><body>Hello Rod, Jane & Freddy</body></html>
Whereas our previous example will only print out:
Content-Type: text/html;
<html><body>Hello Rod, Jane & Freddy</body></html>
Worse, someone could put <img src="http://python.org/pics/pythonHi.gif">
directly into the form, meaning that the script would print out
without a second thought:
Content-Type: text/html;
<html><body>Hello <img src="http://python.org/pics/pythonHi.gif"></body></html>
And the browser would go ahead and load an image where you'd previously just wanted to be able to show a simple name.
HTML::Entities is one of the easiest Perl modules to use. It simply exports two functions into the callers namespace. These can be used to encode or decode strings to or from a form that can be printed directly into a HTML document.
use HTML::Entities;
# define my string my $string = "Rod, Jane & Freddy"; # encode it encode_entities($string);
# print it print "$string\n";
This prints
Rod, Jane & Freddy
Note how the encode_entities
function encodes the string in place,
altering the value it was passed. If you don't want to alter the existing
value, you can simply assign the result of encode_entities
to a new
variable, which will leave the original value untouched.
use HTML::Entities;
# define my string my $string = "Rod, Jane & Freddy"; # encode it my $newstring = encode_entities($string);
# print it print "old: $string\n" . "new: $newstring\n";
This prints:
old: Rod, Jane & Freddy new: Rod, Jane & Freddy
You also have control over which entities you actually want encoded.
By default characters encoded are control chars, high-bit chars, and
the <
, &
, >
, and "
characters, but if you want to
encode other characters then you might want to alter how you encode
things. For example, the '
character is not encoded by default,
but if you're creating XML and using it to delimit you attribute
values you might want to encode any apostrophe that you're placing in
an attribute. You can instruct HTML::Entities what to encode by
placing the characters to be escaped in a string which you pass as a
second argument to encode_entities
like so:
use HTML::Entities; my $string = encode_entities("Mark's House","'") print "<meet where='$string' />";
This prints:
<meet where='Mark's House' />
Note that only the entities contained in the string are encoded, so
things like <
would not be encoded in the above example.
The decode_entities function works essentially the same way as the encode_entities function but in reverse.
use HTML::Entities;
# define my string my $string = "Rod, Jane & Freddy"; # decode it my $newstring = decode_entities($string);
# print it print "old: $string\n" . "new: $newstring\n";
In a lot of cases if you break your HTML by not encoding a ampersand or left angle bracket modern browsers will cope, or at the worst you won't be able to display your web page properly. It's a bad thing to do, but it's not the end of the world. What you've really got to worry about is people actually inserting extra HTML into your document.
On a minimal level this can be used by people to add extra markup that we might not want them to be able to add, for example making the text they've submitted as a comment on a journal entry bigger, or a different colour. Or worse, they could link in inappropriate pictures. All these things can range from making your site look very ugly, to meaning your site might seem to be displaying copyrighted or indecent pictures.
While these matters are pretty bad, it pales in comparison to the risk of having someone place some Javascript in your page. The most basic attack someone could do is place some code to redirect anyone looking at your page to another page on the web:
<script>document.location="http://perladvent.org"</script>
This will cause the browser to stop displaying your page and immediately load up this advent calendar.
Even worse still is the problem of cross site scripting. Javascript can access the cookies that are stored for your site, and these can contain private information like your session details. If someone sends an email which is subsequently read via a web based email client which doesn't encode it's HTML properly it might be possible to send an email with some Javascript in it that's executed when the email is read. One thing this Javascript could do is steal the session info from the site cookie and send it to attacker by encoding it in a URL for an image that the Javascript requests from the attacker's webserver. With this session info the attacker could read the person's email.
This so called Cross Site Scripting is one of the growing number of attacks that can be used against sloppily coded sites.