The 2002 Perl Advent Calendar
[about] | [archives] | [contact] | [home]

On the 7th day of Advent my True Language brought to me..
File::MMagic

What we normally mean by the 'type' of a file is actually the MIME type of a file. Every file sent across the web is sent with it's own MIME type. Attachments in mails have MIME type declarations. For example, a JPEG image is 'image/jpeg' and a web page is 'text/html'.

When we know a file's MIME type then we know what what kind of data it is and what we can do with it. We know what program to load to view it. At the very least, you can use it to check that the data that some user just uploaded to their user page on your webserver is actually valid picture file, and isn't some other kind of binary data that a corrupt client has decided to encode the data as. This of course is following good practice guidelines - never trust any data the user sends you without checking it first.

The File::MMagic module can be used to determine the mime type of a file. It uses all kinds of cunning to do this. Firstly it uses a database of "magic" numbers to look at the first few bytes for telltale signs - for example GIF files start with "GIF" and flash files start with "FWS". If that fails - for example html files don't start with anything special - then the module can use extra regular expression techniques to check both the filename and the contents of the file for give away signs that distinguish them.

The File::MMagic module is pretty easy to use. Essentially it's just case of creating a parser object, and then telling it to look at a file

  use File::MMagic;
  my $mm = File::MMagic->new();
  print "The mime type of '$ARGV[0]' is :"
          $mm->checktype_filename($ARGV[0]) . "\n";

Of course, it can check an open filehandle

  use File::MMagic;
  my $mm = File::MMagic->new();
  # open the file in binary mode
  my $filehandle = IO::File->new("image.jpg")
    or die "coundn't open 'image.jpg': $!";
  binmode $filehandle;
  print "The mime type of 'image.jpg' is :"
          $mm->checktype_filehandle($filehandle) . "\n";

Or even from a chunk of data already loaded into memory:

  use File::MMagic;
  my $mm = File::MMagic->new();
   # open a file in binary mode 
   my $filehandle = IO::File->new("image.jpg")
     or die "coundn't open 'image.jpg': $!";
   binmode $filehandle;
   # read in the entire file into $data
   my $data;
   {
     local $/;   # set it so <> reads all the file at once
     $data = <$filehandle>;  # read in the file
   }
  print "The mime type of 'image.jpg' is :"
          $mm->checktype_contents($data) . "\n";

So with this new found knowledge we can construct an example script that looks at all files in a directory and builds a web page with a graph. First we check each file for it's MIME type and size and store the cumaltive value in a hash.

  #!/usr/bin/perl
  # turn on perl's safety features
  use strict;
  use warnings;
  # load the modules
  use File::MMagic;
  # new parser
  my $mm = File::MMagic->new();
  # open the dir
  opendir DIR, $ARGV[0]
     or die "Couldn't open the directory '$ARGV[0]': $!";
  # work though the files in the dir
  my %files;
  while (my $file = readdir DIR)
  {
    # skip it if it isn't just a normal file
    next unless -f $file;
    # get the mime type and other info
    my $magic =  $mm->checktype_filename($file);
    # delete anything after the mime type
    $magic =~ s/ ;  # look for a the first semicolon
                 .* # and then anything up until
                 $  # the end of line
                 /;/x;
    # add on that size to a hash
    $files{ $magic } += -s $file;
  }
  closedir DIR;

Now using that information we can create a chart using the GD::Chart::hbars module.

  use GD::Graph::hbars;
  use IO::File;
  # create a new pie chart
  my $pie = GD::Graph::hbars->new(400,300);
  # plot the data onto it, and get a GD::Image back
  my $gd = $pie->plot([[keys %files],[values %files]]);
  # open a file to write it to, and save it as a png
  my $img_fh = IO::File->new("chart.png",">")
    or die "Can't open 'chart.png': $!";
  binmode $img_fh;
  print {$img_fh} $gd->png;

And finally print out the HTML. Note that we use the HTML::Entities module to encode the data that we're printing out. This means that any HTML chars like '<' or '>' will be protected - not that we are likely to have these charecters in the directory, but we never know.

  use HTML::Entities;
  # open the file
  my $html_fh = IO::File->new("chart.html",">")
    or die "Can't open 'chart.html': $!";
  # and write out the html
  my $dir = encode_entities($ARGV[0]);
  print {$html_fh} qq{
  <html>
   <head><title>Files by mime type for: $dir</title></head>
  <body>
  <img src="chart.png" width="400" height="300">
  <table>};
  # print a line for each MIME type
  foreach my $key (keys %files)
  {
    print {$html_fh}
       "<tr><td>" . encode_entities($key) . "</td>" .
       "<td>" . int($files{ $key }/1024). "k</td></tr>";
  }
  print {$html_fh} q{
  </table>
  </body>
  </html>
  };

  • Example output from my home dir
  • GD::Graph
  • HTML::Entities