2019 twenty four merry days of Perl Feed

Christmas Movie Time!

PerlIO::zip - 2019-12-08

Have you ever wondered what the best Christmas Movie of all time is?

Spoiler Alert: It's "It's a Wonderful Life".

Of course, that's not just my opinion: I can prove it...with Perl.

Christmas Movies

The first thing we need to prove this comprehensively is a collection of Christmas movies.

Did you know that Wikipedia has a list of Christmas movies on it? Neither did I, but I shouldn't be surprised; It has pretty much everything and the kitchen sink on it. Let's scrape that list and put it in a database.

First, we create need a table:


1: 
2: 
3: 
4: 

 

CREATE TABLE wikipedia_films (
    name TEXT,
    year INTEGER
);

 

We can use the on-disk daemonless SQLite SQL database to do this. Most systems ship with the sqlite3 command line tool that'll write the database from the SQL for us:

    $ sqlite3 /tmp/db < wikipedia_films.sql

Now we need a program that can scrape the web page and populate the database. Easy-peasy with Mojo::UserAgent, which we've covered extensively in past advent calendars:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 

 

#!/usr/bin/perl

use strict;
use 5.024;

use DBI;
use Mojo::UserAgent;

my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");

my $ua = Mojo::UserAgent->new();
my $res = $ua->get(
    'https://en.wikipedia.org/wiki/List_of_Christmas_films'
)->result;
if ($res->is_error) { die $res->message }

$dbh->begin_work;
$res->dom->find('.wikitable tbody tr')->each(sub {
    my $a = $_->at("td:first-child * a");
    next unless $a;
    my $name = $a->text;

    my $year = $_->at("td:nth-child(2)")->text;

    say STDERR "Inserting $name ($year)";
    $dbh->do( <<'SQL', {}, $name, $year );
INSERT INTO wikipedia_films
            (name, year) VALUES (?,?)
SQL
});
$dbh->commit;

 

Now we've got a big old list in our database...but what's the best of these movies?

Downloading The IMDb

If we're being scientific about this we shouldn't just use our opinion. We should use the wisdom of crowds: The Internet Movie Database rating for the film.

So that's straight forward: First we just download the entire of the IMDb...wait, you didn't know we could do that? Sure! The IMDb publishes a bunch of tab separated compressed files of their core data every day.

If we use lwp-mirror to download the files we can mirror the large files to disk.

   $ lwp-mirror https://datasets.imdbws.com/title.basics.tsv.gz title.basics.tsv.gz
   $ lwp-mirror https://datasets.imdbws.com/title.ratings.tsv.gz title.ratings.tsv.gz

Because we're using lwp-mirror we can safely re-run the downloads as often as we want - a new version will only be downloaded when the contents changes.

Importing into Our Database

Okay, we next need to put all of that data into our database. Let's create a new table:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 

 

CREATE TABLE imdb_films (
    title TEXT,
    name TEXT,
    year INTEGER,
    rating FLOAT
);

CREATE INDEX imdb_films_title_idx
ON imdb_films(title);

 

In the database:

    $ sqlite3 /tmp/db < imdb_films.sql

And populate it with films:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 
46: 
47: 
48: 
49: 
50: 
51: 
52: 
53: 
54: 
55: 
56: 
57: 
58: 
59: 
60: 
61: 
62: 
63: 
64: 

 

#!/usr/bin/perl

use strict;
use 5.024;

use DBI;
use List::AllUtils qw( zip );
use Mojo::UserAgent;
use PerlIO::gzip;
use Term::ProgressBar;

my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");

# how big is this file?
my $total = `gunzip -c title.basics.tsv.gz | wc -l`;

# the file is gzipped, so use the gzip layer
# to transparently decompress it as we read it
open my $fh, '<:gzip', 'title.basics.tsv.gz'
or die "Can't open file: $!";

# read the first line in that contains the headings
$_ = <$fh>;
chomp;
my @headings = split /\t/;

# prepare the insert statement
my $sth = $dbh->prepare(<<'SQL');
INSERT INTO imdb_films
    ( title, name, year )
    VALUES
    ( ?, ?, ?)
SQL

# process each row of the file
my $progress = Term::ProgressBar->new ({
    count => $total,
    ETA => 'linear',
});
$dbh->begin_work;
while (<$fh>) {
    chomp;
    my @row_data = split /\t/;

# zip returns an element from each of the arrays
    # in turn i.e. key, value, key, value, key, value...
my %data = zip @headings, @row_data;

# ignore anything that isn't a movie or, ahem, isn't
    # in the true christmas spirit
next if $data{titleType} ne 'movie';
    next if $data{isAdult};

# insert just some of the fields from the data
    # note ratings aren't in here yet - next script!
$sth->execute(@data{qw(
        tconst
        originalTitle
        startYear
    )
});

    $progress->update($.);
}
$dbh->commit;

 

And do almost same thing with the ratings for the films:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 
46: 
47: 
48: 
49: 
50: 
51: 
52: 
53: 
54: 
55: 

 

#!/usr/bin/perl

use strict;
use 5.024;

use DBI;
use List::AllUtils qw( zip );
use Mojo::UserAgent;
use PerlIO::gzip;
use Term::ProgressBar;

my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");

# how big is this file?
my $total = `gunzip -c title.ratings.tsv.gz | wc -l`;

# the file is gzipped, so use the gzip layer
# to transparently decompress it as we read it
open my $fh, '<:gzip', 'title.ratings.tsv.gz'
or die "Can't open file: $!";

# read the first line in that contains the headings
$_ = <$fh>;
chomp;
my @headings = split /\t/;

# prepare the insert statement
my $sth = $dbh->prepare(<<'SQL');
UPDATE imdb_films
    SET rating = ?
    WHERE title = ?
SQL

# process each row of the file
my $progress = Term::ProgressBar->new ({
    count => $total,
    ETA => 'linear',
});
$dbh->begin_work;
while (<$fh>) {
    chomp;
    my @row_data = split /\t/;

# zip returns an element from each of the arrays
    # in turn i.e. key, value, key, value, key, value...
my %data = zip @headings, @row_data;

    $sth->execute(@data{qw(
        averageRating
        tconst
    )
});

    $progress->update($.);
}
$dbh->commit;

 

The Moment of Truth

Finally, we can categorically prove what's the best Christmas Film

    $ sqlite3 /tmp/db
    sqlite> SELECT name
       ...> FROM wikipedia_films
       ...> JOIN imdb_films
       ...> USING (name, year)
       ...> ORDER BY rating DESC
       ...> LIMIT 1;
    It's a Wonderful Life
Gravatar Image This article contributed by: Mark Fowler <mark@twoshortplanks.com>