Perl 2004 Advent Calendar: Encode

On the 11th day of Advent my True Language brought to me..

Encode

It's a wide world out there, and the one hundred and twenty characters in ASCII just doesn't cut it anymore. In a global marketplace - or whenever we want to talk to those Paris Perl Mongueurs - we need to use a bigger range of characters. The funny Es with acute signs on them, weird greek characters, and things that just look like squiggles. We need them all.

Whenever we want to tell the terminal to print these characters to a terminal or save them to a file we need to encode them in an character encoding so they can be represented in bytes. Whenever we read these characters in we need to decode the byte sequences.

Encode can do this for us.

[Read the documentation for Encode on search.cpan.org]

Since Perl 5.6, perl has been able to store Unicode characters in strings. Consider the Unicode character Ω (omega) with Unicode code point 937 (i.e. it's the 938th Unicode character, but we start counting from 0 not 1) and would typically be used like this in a mathematical formula:

  a Ω b

This could be simply be entered as a Perl string by using the chr function to convert from the code point to a character.

 my $string = "a " . chr(937) . " b";

Or by using the \x escape string inside a string with the hexadecimal code for 937:

 my $string = "a \x{03A9} b"

Or by using the \N escape string inside a string with the Unicode name for the character:

 # load the character names into our script
 use charnames ":full";

 my $string = "a \N{GREEK CAPITAL LETTER OMEGA} b";

Or by using a Unicode aware text editor, the use utf8 pragma and getting your editor to save the script using the utf8 byte sequence encoding, meaning you can just type the sequence with your keyboard.

 # declare everything after this command will
 # be represented on disk by utf8 bytes
 use utf8;

 my $string = "a Ω b";

Rendering the string

All of these approaches work - Perl now has a five character string in memory that contains the correct character. For example, if we write something to print out the code point of each character we get the right thing:

 foreach my $index (0..(length($string)-1)
 {
   # get the character
   my $char = substr($string, $index, 1);

   # work out then print the code points
   my $codepoint = ord($char);
   print "$codepoint\n"
 }

Prints

For the a, the space, the omega, the second space and the b.

The trouble comes when you want to print out the characters themselves. The question is of course "how do you send the character out to the terminal?" Printing the a out is trivial; Just sending the byte 97 to the terminal will cause it to render a letter a on the screen. However, there isn't a single byte that represents omega. It depends on the encoding that the terminal you're using is using at the time. You need to know the correct byte (or bytes) to send to the terminal in the encoding scheme that it's using to get it to display the letter you want.

For example, if you set your terminal to use "iso-8859-7" then sending byte 217 to will cause it to print an omega (where if you have it set to latin-1 as normal it'll just print a Ù.) If you have a utf8 terminal then you'll be needing to send it the multi-byte sequence of 206 and 169. The byte sequence you're using is purely arbitrary - it's what's defined in the form of encoding you're using.

So how do you work out what to send?

Using Encode to do the Character Translation

The Encode module that ships with perl 5.8 can be used to encode string that perl holds in memory into byte representations (and, in fact go the other way and decode byte representations and make perl strings.) For example, converting our string into "iso-8859-7".

  use Encode;
  my $bytes = Encode::encode("iso-8859-7", $string);

The scalar $bytes now contains the bytes that represent $string in the encoding we passed. Printing out one byte per line like we did for the characters above gives us:

Most of the numbers are the same - because the byte that represents them in iso-8859-7 is the same as the Unicode character number. Only 937 has changed to 217. Printing this scalar to our iso-8859-7 terminal causes it to display the right characters.

  a Ω b

Automatic Translation

Whenever you print something out with Perl it has to work out which bytes to send for each character. By default (in latin-1 locales at least) it does no translation on the characters it's printing out mapping the character code point directly to the byte it prints out (this also means that when you print scalars that contain binary data or already encoded byte sequences then thankfully no extra encoding happens.)

It's possible to tell perl to automatically translate the string into the correct format when print it out. For example, to write a file as iso-8859-7 you can use a PerlIO layer to do the translation for you:

  open my $fh, ">:encoding(iso-8859-7)", "file"
    or die "Can't write to file: $!";

  print $fh $string;

Anything printed to $fh will automatically be encoded into iso-8859-7. binmode can be used on an already existing file handle:

  bimode STDOUT, "encoding(iso-8859-7)";

Decoding

Decoding a string of bytes works pretty much the same way as encoding one - just in reverse. We can read a file that's encoded in a particular encoding automatically:

  open my $fh, "<:encoding(iso-8859-7)", "file"
    or die "Can't read from file: $!";

  my $string = <$fh>;

Or, if we've already got a bunch of bytes in a string, we can decode it with Encode::decode;

  use Encode;
  my $string = Encode::decode("iso-8859-7", $bytes);

perlunicode

Encode::PerlIO