Perl 2004 Advent Calendar: SQL::Abstract

Imagine you've got a large mysql database with the following table structure

  +=========+=======+======+=====+=========+=====+=========+=======+
  | user_id | first | last | age | address | zip | country | email |
  +---------+-------+------+-----+---------+-----+---------+-------+
  |         |       |      |     |         |     |         |       |

Writing a Class::DBI wrapper for it would be simple:

  package Customer::DBI;
  use base 'Class::DBI';
  __PACKAGE__->connection('dbi:mysql:customer', 'root', 'god');

  package Customer::User;
  use base 'Customer::DBI';
  __PACKAGE__->table('users');
  __PACKAGE__->columns( All => qw(
    user_id first last age address zip country email
  ));

And then we could access the database like so:

  # meet the flintstones
  my @users = Customer::User->search( last => "Flintstone" );

And we can add custom constructors that allow us to do more complex queries than simple matches

  # in Customer::User
  __PACKAGE__->add_constructor(aged_over => 'age >= ?');

  # later, in the main code
  # select all the adults
  my @users = Customer::User->aged_over(18);
  
=head2 The Problem

The problem with this approach is that is slow. It has to create a user object for every row in the database. Let's consider the simple case where we need to extract the users that match from a query and write them out to a CSV file. The dumb Class::DBI approach is incredibly slow (though very easy to code):

  #!/usr/bin/perl

  use strict;
  use warnings;

  # load our support modules
  use Customer::User;
  use Text::CSV_XS;
  use IO::File;

  # create the csv file
  my $fh = IO::File->new("adults.csv", ">")
    or die "Can't open adults: $!";

  # create the writer object
  my $csv = Text::CSV_XS->new();

  # print out the column headers
  $csv->print($fh, [qw(
    user_id first last age address zip country email
  )]);

  # extract all the adults
  my @users = Customer::User->aged_over(18);

  # write out each object
  foreach my $user (@users)
  {
    $csv->print($fh, [
      $user->user_id,
      $user->first,
      $user->last,
      $user->age,
      $user->address,
      $user->zip,
      $user->country,
      $user->email,
    ]);
  }

This is terrible! We're slurping each and every user into memory at once (which is bad news if you've got a database with 200,000 users in it) and creating an object for them, and then accessing each of the data elements with an accessor method. This is total overkill for what we're trying to do.

To be fair, Class::DBI isn't really this stupid. Used properly we can improve the performance somewhat by not dragging each user into memory at once, replacing the fetch loop with an iterator:

  # extract all the adults, getting an iterator object
  # back by calling our constructor in scalar context
  my $users = Customer::User->aged_over(18);

  # write out each object, calling next on the iterator
  # which will finally return false when nothing is left
  while (my $user = $users->next)
  {
    ...
  }

Meaning that only one user is held in memory at one go. We're still stuck with the overhead of pulling the data in, creating an object and then using that object's accessor to get at the data, when all we need to do is spit that data back out again.

The Pure DBI solution

We can revert back to the good old days of using pure DBI calls if we want to get the best speed out of the database.

  #!/usr/bin/perl

  use strict;
  use warnings;

  # load our support modules
  use DBI;
  use Text::CSV_XS;
  use IO::File;

  # create the csv file
  my $fh = IO::File->new("adults.csv", ">")
    or die "Can't open adults: $!";

  # create the writer object
  my $csv = Text::CSV_XS->new();

  # print out the column headers
  $csv->print($fh, [qw(
    user_id first last age address zip country email
  )]);

  # connect to the database
  my $dbh = DBI->connect('dbi:mysql:customer', 'root', 'god')
    or die "Can't connect to the database";

  # prepare the query
  my $sth = $dbh->prepare(q{
   SELECT user_id, first, last, age, address, zip, country, email
      FROM users WHERE age >= ?
  }) or die "Problem preparing";

  # execute the query
  $sth->execute(18)
    or die "Problem executing";

  # write out each row we get back 
  while (my $data = $sth->fetchrow_arrayref())
    { $csv->print($fh, $data) }

Of course the problem with this is that it only works for the one solution we've got here. We had to write the SQL ourselves. What if we want to allow the user to specify something on the command line to configure a more complex search? This is trivial with Class::DBI, as we can replace our constructor with a search

  # allow people to specify things like
  #   ./script.pl country 'GB'
  # to return all the people in the United Kingdom
  my $users = Customer::User->search(@ARGV);

Class::DBI will do the building of the SQL for us needed to extract the data behind the scenes. But what are we to do in our fast version?

SQL::Abstract

It's obvious that what we need to do is programatically generate the SQL ourselves. This isn't actually as easy as it might seem to do as we might think. Quoting the right things can be a pain, as can be getting the correct spacing in. You've got to remember to add a WHERE keyword if you're adding arguments and not if you're not. Perl's a lot more forgiving that MySQL and will happily let you create lists with extra commas on the end - MySQL won't. All in all it's a clear path for writing some very messy code that's hard to debug.

Luckily, we're not the first people to face this issue and someone else has gone to the trouble of writing the messy code for us, and debugged it for us, and put it together in a module called SQL::Abstract. Thank you Nathan Wiger!

Let's just have a quick look at how SQL::Abstract can be used.

  # create our SQL creation object
  my $sa = SQL::Abstract->new();

  my ($sql, @bind) = $sa->select(
   
    # the table
    "users",

    # what fields we want returned
    [qw( user_id first last age address zip country email )],

    # the where clause is a hashref of columns / values
    # e.g. { first => 'fred', last => 'barney' }
    { @ARGV },
  );

If we print what we got back:

  use Data::Dumper::Simple;
  print Dumper($sql,@bind);

We get what we might expect:

  $sql = 'SELECT user_id, first, last, age, address, 
          zip, country, email FROM users WHERE ( country = ? )';
  @bind = (
            'GB'
          );

Excellent. Plumbing this into our code in our script at the right point makes the section looks like this:

  # connect to the database
  my $dbh = DBI->connect('dbi:mysql:customer', 'root', 'god')
    or die "Can't connect to the database";

  # create the SQL and bind
  my $sa = SQL::Abstract->new();
  my ($sql, @bind) = $sa->select(
    "users",
    [qw( user_id first last age address zip country email )],
    { @ARGV },
  );

  # prepare and execute query
  my $sth = $dbh->prepare($sql) or die "Problem preparing";
  $sth->execute(@bind) or die "Problem executing";

  # write out each row we get back 
  while (my $data = $sth->fetchrow_arrayref())
    { $csv->print($fh, $data) }

More Complex Queries

Alas, SQL isn't as simple as just looking for simple values. We often want to create complicated SQL. Luckily, we can create some pretty complex SQL with SQL::Abstract too. If we don't want to use = in our where clause we can use the hash notation to provide our own operator:

  # maybe it's because I'm a Londoner...
  my ($sql, @bind) = $sa->select(
    "users", "*", {
       address => { "LIKE" => '%London%' },
       country => "GB",
  });

Which produces the SQL:

  $sql = 'SELECT * FROM users WHERE ( address LIKE ? AND country = ? )';
  @bind = (
            '%London%',
            'GB'
          );

We can also do alternation with array refs:

  my ($sql, @bind) = $sa->select(
    "users", "*", {
     first => [qw( rod jane freddy )]
  });

Which creates a SQL statement with a bunch of OR keywords in it to offer us the various possibilities

  $sql = 'SELECT * FROM users WHERE 
          ( ( ( first = ? ) OR ( first = ? ) OR ( first = ? ) ) )';
  @bind = (
            'rod',
            'jane',
            'freddy'
          );

Of course, the other methods such as insert, delete and update also work:

  # add new values
  my ($sql, @bind) = $sa->insert(
   "users", {
     first => "mark",
     last  => "fowler",
     email => 'mark@twoshortplanks.com',
     age   => "26",
  });

  # delete the kids
  my ($sql, @bind) = $sa->delete("users", {
     age => { "<", 18 }
  });

  # happy birthday to me!
  my ($sql, @bind) = $sa->update(
   "users", {
     age   => "27",
  },
  {
     email => 'mark@twoshortplanks.com',
  });

Getting back to Class::DBI

As we can see, SQL::Abstract allows us to create some really quite complex SQL. Sometimes we'd like the power of doing this directly from within Class::DBI. Handily, SQL::Abstract has a simple where method that takes the same arguments as the select method but returns just the WHERE clause of the SQL statement, which then can be fed into the add_constructor as we demonstrated above. ### # in Customer::User

  # create our factory
  my $sa = SQL::Abstract->new();

  # make some sql
  my ($where) = $sa->where(
    age => { "<=", "?" }
  );

  # create the constructor
  __PACKAGE__->add_constructor( aged_over => $where );

MySQL Optimisations To Be Aware Of

One of the things I've been glossing over in this example is the performance of MySQL. In the pure DBI section above, and in the Class::DBI iterator example I assumed because I can get the database to give me one record back at a time that DBI was only fetching one record at a time from the database server. This isn't true.

Current implementations of MySQL by default drag every single result back from the database and store the results in the client before any results are returned to Perl (this is a MySQL limitation not a Perl one). This means that if you're extracting 200,000 users from the database then you're going to have to have space to store those 200,000 records in memory on your machine at once.

There is a way around this. You can tell mysql to give you back the data one record at a time:

  # do this before the execute
  $sth->{'mysql_use_result'} = 1;

Of course, there is a price to pay for this. If you do this then until you complete your run the table you're querying will be locked for updates - not a good thing if you're accessing the live database you're running your main web site off of, and what you're doing is going to be taking a couple of hours.

There's another way around this, where you don't get the locking problem and you don't have the bringing it back into memory in one go problem, but it takes a lot of time to implement: Switch to a database that supports cursors (like PostgreSQL). Of course, that's a lot more work.