A lot of what we do as programmers is manipulating data. As such we often need to store that data in a database.
We've looked at one solution - Class::DBI - for easy database access in the past. Today we look at some of the problems associated with wrapping a database in an object model, and at one of the modules SQL::Abstract that can help us when we need to abandon that approach and create pure SQL directly.
Imagine you've got a large mysql database with the following table structure
+=========+=======+======+=====+=========+=====+=========+=======+ | user_id | first | last | age | address | zip | country | email | +---------+-------+------+-----+---------+-----+---------+-------+ | | | | | | | | |
Writing a Class::DBI wrapper for it would be simple:
package Customer::DBI; use base 'Class::DBI'; __PACKAGE__->connection('dbi:mysql:customer', 'root', 'god');
package Customer::User; use base 'Customer::DBI'; __PACKAGE__->table('users'); __PACKAGE__->columns( All => qw( user_id first last age address zip country email ));
And then we could access the database like so:
# meet the flintstones my @users = Customer::User->search( last => "Flintstone" );
And we can add custom constructors that allow us to do more complex queries than simple matches
# in Customer::User __PACKAGE__->add_constructor(aged_over => 'age >= ?');
# later, in the main code # select all the adults my @users = Customer::User->aged_over(18); =head2 The Problem
The problem with this approach is that is slow. It has to create a user object for every row in the database. Let's consider the simple case where we need to extract the users that match from a query and write them out to a CSV file. The dumb Class::DBI approach is incredibly slow (though very easy to code):
#!/usr/bin/perl
use strict; use warnings;
# load our support modules use Customer::User; use Text::CSV_XS; use IO::File;
# create the csv file my $fh = IO::File->new("adults.csv", ">") or die "Can't open adults: $!";
# create the writer object my $csv = Text::CSV_XS->new();
# print out the column headers $csv->print($fh, [qw( user_id first last age address zip country email )]);
# extract all the adults my @users = Customer::User->aged_over(18);
# write out each object foreach my $user (@users) { $csv->print($fh, [ $user->user_id, $user->first, $user->last, $user->age, $user->address, $user->zip, $user->country, $user->email, ]); }
This is terrible! We're slurping each and every user into memory at once (which is bad news if you've got a database with 200,000 users in it) and creating an object for them, and then accessing each of the data elements with an accessor method. This is total overkill for what we're trying to do.
To be fair, Class::DBI isn't really this stupid. Used properly we can improve the performance somewhat by not dragging each user into memory at once, replacing the fetch loop with an iterator:
# extract all the adults, getting an iterator object # back by calling our constructor in scalar context my $users = Customer::User->aged_over(18);
# write out each object, calling next on the iterator # which will finally return false when nothing is left while (my $user = $users->next) { ... }
Meaning that only one user is held in memory at one go. We're still stuck with the overhead of pulling the data in, creating an object and then using that object's accessor to get at the data, when all we need to do is spit that data back out again.
We can revert back to the good old days of using pure DBI calls if we want to get the best speed out of the database.
#!/usr/bin/perl
use strict; use warnings;
# load our support modules use DBI; use Text::CSV_XS; use IO::File;
# create the csv file my $fh = IO::File->new("adults.csv", ">") or die "Can't open adults: $!";
# create the writer object my $csv = Text::CSV_XS->new();
# print out the column headers $csv->print($fh, [qw( user_id first last age address zip country email )]);
# connect to the database my $dbh = DBI->connect('dbi:mysql:customer', 'root', 'god') or die "Can't connect to the database";
# prepare the query my $sth = $dbh->prepare(q{ SELECT user_id, first, last, age, address, zip, country, email FROM users WHERE age >= ? }) or die "Problem preparing";
# execute the query $sth->execute(18) or die "Problem executing";
# write out each row we get back while (my $data = $sth->fetchrow_arrayref()) { $csv->print($fh, $data) }
Of course the problem with this is that it only works for the one solution we've got here. We had to write the SQL ourselves. What if we want to allow the user to specify something on the command line to configure a more complex search? This is trivial with Class::DBI, as we can replace our constructor with a search
# allow people to specify things like # ./script.pl country 'GB' # to return all the people in the United Kingdom my $users = Customer::User->search(@ARGV);
Class::DBI will do the building of the SQL for us needed to extract the data behind the scenes. But what are we to do in our fast version?
It's obvious that what we need to do is programatically generate the SQL ourselves. This isn't actually as easy as it might seem to do as we might think. Quoting the right things can be a pain, as can be getting the correct spacing in. You've got to remember to add a WHERE keyword if you're adding arguments and not if you're not. Perl's a lot more forgiving that MySQL and will happily let you create lists with extra commas on the end - MySQL won't. All in all it's a clear path for writing some very messy code that's hard to debug.
Luckily, we're not the first people to face this issue and someone else has gone to the trouble of writing the messy code for us, and debugged it for us, and put it together in a module called SQL::Abstract. Thank you Nathan Wiger!
Let's just have a quick look at how SQL::Abstract can be used.
# create our SQL creation object my $sa = SQL::Abstract->new();
my ($sql, @bind) = $sa->select( # the table "users",
# what fields we want returned [qw( user_id first last age address zip country email )],
# the where clause is a hashref of columns / values # e.g. { first => 'fred', last => 'barney' } { @ARGV }, );
If we print what we got back:
use Data::Dumper::Simple; print Dumper($sql,@bind);
We get what we might expect:
$sql = 'SELECT user_id, first, last, age, address, zip, country, email FROM users WHERE ( country = ? )'; @bind = ( 'GB' );
Excellent. Plumbing this into our code in our script at the right point makes the section looks like this:
# connect to the database my $dbh = DBI->connect('dbi:mysql:customer', 'root', 'god') or die "Can't connect to the database";
# create the SQL and bind my $sa = SQL::Abstract->new(); my ($sql, @bind) = $sa->select( "users", [qw( user_id first last age address zip country email )], { @ARGV }, );
# prepare and execute query my $sth = $dbh->prepare($sql) or die "Problem preparing"; $sth->execute(@bind) or die "Problem executing";
# write out each row we get back while (my $data = $sth->fetchrow_arrayref()) { $csv->print($fh, $data) }
Alas, SQL isn't as simple as just looking for simple values. We often
want to create complicated SQL. Luckily, we can create some pretty
complex SQL with SQL::Abstract too. If we don't want to use =
in our where clause we can use the hash notation to provide our own
operator:
# maybe it's because I'm a Londoner... my ($sql, @bind) = $sa->select( "users", "*", { address => { "LIKE" => '%London%' }, country => "GB", });
Which produces the SQL:
$sql = 'SELECT * FROM users WHERE ( address LIKE ? AND country = ? )'; @bind = ( '%London%', 'GB' );
We can also do alternation with array refs:
my ($sql, @bind) = $sa->select( "users", "*", { first => [qw( rod jane freddy )] });
Which creates a SQL statement with a bunch of OR
keywords in it
to offer us the various possibilities
$sql = 'SELECT * FROM users WHERE ( ( ( first = ? ) OR ( first = ? ) OR ( first = ? ) ) )'; @bind = ( 'rod', 'jane', 'freddy' );
Of course, the other methods such as insert
, delete
and
update
also work:
# add new values my ($sql, @bind) = $sa->insert( "users", { first => "mark", last => "fowler", email => 'mark@twoshortplanks.com', age => "26", });
# delete the kids my ($sql, @bind) = $sa->delete("users", { age => { "<", 18 } });
# happy birthday to me! my ($sql, @bind) = $sa->update( "users", { age => "27", }, { email => 'mark@twoshortplanks.com', });
As we can see, SQL::Abstract allows us to create some really quite
complex SQL. Sometimes we'd like the power of doing this directly
from within Class::DBI. Handily, SQL::Abstract has a simple
where
method that takes the same arguments as the select
method
but returns just the WHERE clause of the SQL statement, which then can
be fed into the add_constructor
as we demonstrated above.
###
# in Customer::User
# create our factory my $sa = SQL::Abstract->new();
# make some sql my ($where) = $sa->where( age => { "<=", "?" } );
# create the constructor __PACKAGE__->add_constructor( aged_over => $where );
One of the things I've been glossing over in this example is the performance of MySQL. In the pure DBI section above, and in the Class::DBI iterator example I assumed because I can get the database to give me one record back at a time that DBI was only fetching one record at a time from the database server. This isn't true.
Current implementations of MySQL by default drag every single result back from the database and store the results in the client before any results are returned to Perl (this is a MySQL limitation not a Perl one). This means that if you're extracting 200,000 users from the database then you're going to have to have space to store those 200,000 records in memory on your machine at once.
There is a way around this. You can tell mysql to give you back the data one record at a time:
# do this before the execute $sth->{'mysql_use_result'} = 1;
Of course, there is a price to pay for this. If you do this then until you complete your run the table you're querying will be locked for updates - not a good thing if you're accessing the live database you're running your main web site off of, and what you're doing is going to be taking a couple of hours.
There's another way around this, where you don't get the locking problem and you don't have the bringing it back into memory in one go problem, but it takes a lot of time to implement: Switch to a database that supports cursors (like PostgreSQL). Of course, that's a lot more work.