Perl 2004 Advent Calendar: Text::Unidecode

On the 12th day of Advent my True Language brought to me..

Text::Unidecode

For all its merits, sometimes we just don't want to deal with Unicode. We're often in a situation where by we want to have a plain old ASCII version of what we're looking at. Léon gets annoyed when he has to use "Leon" instead, but everyone knows what it means.

What Text::Unidecode does is convert a random string into just ASCII characters. It's not perfect - it has to fudge a lot of stuff, but to be fair there's no one way to this properly, since everyone disagrees on what maps to what. The fact is that it works however, and that's what counts.

[Read the documentation for Text::Unidecode on search.cpan.org]

There's not really much that can be explained about this module that a simple example won't cover:

  use utf8;
  use Text::Unidecode;
  print unidecode("Léon & møøse\n");

Prints out

  Leon and Moose

Note the é has been downgraded to an e and likewise the ø have been downgraded to os. It can cope with quite complex stuff...for example, throwing it a couple of Chinese characters works quite well:

  use utf8;
  use Text::Unidecode;
  print unidecode("\x{5317}\x{4EB0}\n");

prints

  Bei Jing

It replaces \x{5317} with Bei and \x{4EB0} with Jing. Of course, this is a rough approximation - the pronunciation of many characters changes quite a lot depending on dialect - or so I've been told.

It's worth remembering that Text::Unidecode's not perfect however. For example the ™ trademark sign doesn't get converted to TM.

  use utf8;
  use Text::Unidecode;
  print unidecode("Log™");

Sadly just prints

Log

The ™ has been converted to a space. Still, it's better than nothing - and a lot better than having to worry about it all ourselves. At least we know for sure that no matter what comes out the other end of the unidecode routine that it's going to be in ASCII

Using this in practice

My good friend was talking to me the other day about rearranging his mp3 collection. He wanted to put each of his mp3s in a different directory depending on who the artist was, and inside those directories he wanted a directory for each album, and finally within those directories he wanted the mp3 files, named after the track name.

For example, he's looking to create a directory structure that looks somewhat like this:

  mp3s
  |
  +-- Green_Day
  |   |
  |   +-- Dookie
  |   |   |
  |   |   +-- Burnout
  |   |   +-- Having_A_Blast.mp3
  |   |   +-- Chump
  |   |   +-- Long_View.mp3
  |   |   
  |   +-- Insomniac
  |       |
  |       +-- Armatage_Shanks.mp3
  |       +-- Brat.mp3
  |       +-- Stuck_With_Me.mp2
  |
  +-- Miles_Davis
      |
      +-- Bags_Groove
          |
          +--Bags_Grove.mp3
          +--A_Gal_in_Calico.mp3
          +--Minor_March.mp3

We can find all the existing files quite easily using File::Find::Rule.

  use File::Find::Rule;
  my @files = File::Find::Rule->file
                              ->name("*.mp3)
                              ->in("Music");

Each MP3 file has a MP3 tag which contains the information we need (the artist, the title and the album). We can access this kind of information with MP3::Info.

  foreach my $filename (@files)
  {
    # find out where these things are meant to be
    use MP3::Info (:all);

    # get the values;
    use_mp3_utf8(1);
    my $tag = get_mp3tag($filename);

    # decode the utf8 bytes into chars
    use Encode qw(decode);
    foreach (values %{ $tag })
     { $_ = decode("utf8", $_) }

    # extract what we're interested in
    my $title  = $tag->{TITLE}  || "Unknown";
    my $artist = $tag->{ARTIST} || "Unknown";
    my $album  = $tag->{ALBUM}  || "Unknown";

    # work out where they're going
    use File::Spec::Functions; 
    use File::Path;
    use File::Copy;

    my $dir = catdir("mp3s", $artist, $album);
    mkpath([$dir]) unless -d $dir;
 
    # copy the mp3 there
    my $dest = catfile($dir, "$title.mp3");
    copy($filename, $dest);
  }

Of course, we're completely ignoring one huge problem - that the files might contain strange Unicode characters (or illegal ASCII ones fro that matter). What we need to do is alter the script to munge the values as they come in

  my $title  = munge($tag->{TITLE}  || "Unknown");
  my $artist = munge($tag->{ARTIST} || "Unknown");
  my $album  = munge($tag->{ALBUM}  || "Unknown");

And then write our munging routine:

  sub munge
  {
    local $_ = shift;

    # make ASCII
    $_ = unidecode($_);

    # convert some symbols into letters
    s/@/at/g;
    s/&/and/g;
    s/*/star/g;

    # convert all nastiness to spaces
    s/[^a-zA-Z0-9,.!\[\]'"|-_()]/ /g;

    # make all whitespace single spaces
    s/\s+/ /g;

    # remove starting or trailing whitespace
    s/^\s+//;
    s/\s+$//;

    # make all spaces underscores
    tr/ /_/g;

    return $_;
  }

Huzzah! We're done. This, of course, assumes that our MP3s have valid ID3 tags. That, however, is a discussion for another day.

MP3::Info