For all its merits, sometimes we just don't want to deal with Unicode. We're often in a situation where by we want to have a plain old ASCII version of what we're looking at. Léon gets annoyed when he has to use "Leon" instead, but everyone knows what it means.
What Text::Unidecode does is convert a random string into just ASCII characters. It's not perfect - it has to fudge a lot of stuff, but to be fair there's no one way to this properly, since everyone disagrees on what maps to what. The fact is that it works however, and that's what counts.
There's not really much that can be explained about this module that a simple example won't cover:
use utf8; use Text::Unidecode; print unidecode("Léon & møøse\n");
Prints out
Leon and Moose
Note the é
has been downgraded to an e
and likewise
the ø
have been downgraded to o
s. It can cope with
quite complex stuff...for example, throwing it a couple of
Chinese characters works quite well:
use utf8; use Text::Unidecode; print unidecode("\x{5317}\x{4EB0}\n");
prints
Bei Jing
It replaces \x{5317}
with Bei
and \x{4EB0} with
Jing
. Of
course, this is a rough approximation - the pronunciation of many
characters changes quite a lot depending on dialect - or so I've been
told.
It's worth remembering that Text::Unidecode's not perfect however.
For example the ™
trademark sign doesn't get converted to
TM
.
use utf8; use Text::Unidecode; print unidecode("Log™");
Sadly just prints
Log
The ™
has been converted to a space. Still, it's better
than nothing - and a lot better than having to worry about it all
ourselves. At least we know for sure that no matter what comes out
the other end of the unidecode
routine that it's going to be in
ASCII
My good friend was talking to me the other day about rearranging his mp3 collection. He wanted to put each of his mp3s in a different directory depending on who the artist was, and inside those directories he wanted a directory for each album, and finally within those directories he wanted the mp3 files, named after the track name.
For example, he's looking to create a directory structure that looks somewhat like this:
mp3s | +-- Green_Day | | | +-- Dookie | | | | | +-- Burnout | | +-- Having_A_Blast.mp3 | | +-- Chump | | +-- Long_View.mp3 | | | +-- Insomniac | | | +-- Armatage_Shanks.mp3 | +-- Brat.mp3 | +-- Stuck_With_Me.mp2 | +-- Miles_Davis | +-- Bags_Groove | +--Bags_Grove.mp3 +--A_Gal_in_Calico.mp3 +--Minor_March.mp3
We can find all the existing files quite easily using File::Find::Rule.
use File::Find::Rule; my @files = File::Find::Rule->file ->name("*.mp3) ->in("Music");
Each MP3 file has a MP3 tag which contains the information we need (the artist, the title and the album). We can access this kind of information with MP3::Info.
foreach my $filename (@files) { # find out where these things are meant to be use MP3::Info (:all);
# get the values; use_mp3_utf8(1); my $tag = get_mp3tag($filename);
# decode the utf8 bytes into chars use Encode qw(decode); foreach (values %{ $tag }) { $_ = decode("utf8", $_) }
# extract what we're interested in my $title = $tag->{TITLE} || "Unknown"; my $artist = $tag->{ARTIST} || "Unknown"; my $album = $tag->{ALBUM} || "Unknown";
# work out where they're going use File::Spec::Functions; use File::Path; use File::Copy;
my $dir = catdir("mp3s", $artist, $album); mkpath([$dir]) unless -d $dir; # copy the mp3 there my $dest = catfile($dir, "$title.mp3"); copy($filename, $dest); }
Of course, we're completely ignoring one huge problem - that the files might contain strange Unicode characters (or illegal ASCII ones fro that matter). What we need to do is alter the script to munge the values as they come in
my $title = munge($tag->{TITLE} || "Unknown"); my $artist = munge($tag->{ARTIST} || "Unknown"); my $album = munge($tag->{ALBUM} || "Unknown");
And then write our munging routine:
sub munge { local $_ = shift;
# make ASCII $_ = unidecode($_);
# convert some symbols into letters s/@/at/g; s/&/and/g; s/*/star/g;
# convert all nastiness to spaces s/[^a-zA-Z0-9,.!\[\]'"|-_()]/ /g;
# make all whitespace single spaces s/\s+/ /g;
# remove starting or trailing whitespace s/^\s+//; s/\s+$//;
# make all spaces underscores tr/ /_/g;
return $_; }
Huzzah! We're done. This, of course, assumes that our MP3s have valid ID3 tags. That, however, is a discussion for another day.