It's a wide world out there, and the one hundred and twenty characters in ASCII just doesn't cut it anymore. In a global marketplace - or whenever we want to talk to those Paris Perl Mongueurs - we need to use a bigger range of characters. The funny Es with acute signs on them, weird greek characters, and things that just look like squiggles. We need them all.
Whenever we want to tell the terminal to print these characters to a terminal or save them to a file we need to encode them in an character encoding so they can be represented in bytes. Whenever we read these characters in we need to decode the byte sequences.
Encode can do this for us.
Since Perl 5.6, perl has been able to store Unicode characters in
strings. Consider the Unicode character Ω
(omega) with
Unicode code point 937 (i.e. it's the 938th Unicode character, but we
start counting from 0 not 1) and would typically be used like this in
a mathematical formula:
a Ω b
This could be simply be entered as a Perl string by using the chr
function to convert from the code point to a character.
my $string = "a " . chr(937) . " b";
Or by using the \x
escape string inside a string with the
hexadecimal code for 937:
my $string = "a \x{03A9} b"
Or by using the \N
escape string inside a string with the Unicode
name for the character:
# load the character names into our script use charnames ":full";
my $string = "a \N{GREEK CAPITAL LETTER OMEGA} b";
Or by using a Unicode aware text editor, the use utf8
pragma and
getting your editor to save the script using the utf8 byte sequence
encoding, meaning you can just type the sequence with your keyboard.
# declare everything after this command will # be represented on disk by utf8 bytes use utf8;
my $string = "a Ω b";
All of these approaches work - Perl now has a five character string in memory that contains the correct character. For example, if we write something to print out the code point of each character we get the right thing:
foreach my $index (0..(length($string)-1) { # get the character my $char = substr($string, $index, 1);
# work out then print the code points my $codepoint = ord($char); print "$codepoint\n" }
Prints
97 32 937 32 98
For the a
, the space, the omega, the second space and the b
.
The trouble comes when you want to print out the characters
themselves. The question is of course "how do you send the character
out to the terminal?" Printing the a
out is trivial; Just sending
the byte 97
to the terminal will cause it to render a letter a on
the screen. However, there isn't a single byte that represents omega.
It depends on the encoding that the terminal you're using is using at
the time. You need to know the correct byte (or bytes) to send to the
terminal in the encoding scheme that it's using to get it to display
the letter you want.
For example, if you set your terminal to use "iso-8859-7" then sending
byte 217
to will cause it to print an omega (where if you have it
set to latin-1 as normal it'll just print a Ù
.) If you
have a utf8 terminal then you'll be needing to send it the multi-byte
sequence of 206
and 169
. The byte sequence you're using is
purely arbitrary - it's what's defined in the form of encoding you're
using.
So how do you work out what to send?
The Encode
module that ships with perl 5.8 can be used to encode
string that perl holds in memory into byte representations (and,
in fact go the other way and decode byte representations and make perl
strings.) For example, converting our string into "iso-8859-7".
use Encode; my $bytes = Encode::encode("iso-8859-7", $string);
The scalar $bytes now contains the bytes that represent $string in the encoding we passed. Printing out one byte per line like we did for the characters above gives us:
97 32 217 32 98
Most of the numbers are the same - because the byte that represents them in iso-8859-7 is the same as the Unicode character number. Only 937 has changed to 217. Printing this scalar to our iso-8859-7 terminal causes it to display the right characters.
a Ω b
Whenever you print something out with Perl it has to work out which bytes to send for each character. By default (in latin-1 locales at least) it does no translation on the characters it's printing out mapping the character code point directly to the byte it prints out (this also means that when you print scalars that contain binary data or already encoded byte sequences then thankfully no extra encoding happens.)
It's possible to tell perl to automatically translate the string into the correct format when print it out. For example, to write a file as iso-8859-7 you can use a PerlIO layer to do the translation for you:
open my $fh, ">:encoding(iso-8859-7)", "file" or die "Can't write to file: $!";
print $fh $string;
Anything printed to $fh
will automatically be encoded into
iso-8859-7. binmode
can be used on an already existing file
handle:
bimode STDOUT, "encoding(iso-8859-7)";
Decoding a string of bytes works pretty much the same way as encoding one - just in reverse. We can read a file that's encoded in a particular encoding automatically:
open my $fh, "<:encoding(iso-8859-7)", "file" or die "Can't read from file: $!";
my $string = <$fh>;
Or, if we've already got a bunch of bytes in a string, we can decode
it with Encode::decode
;
use Encode; my $string = Encode::decode("iso-8859-7", $bytes);