Have you ever wondered what the best Christmas Movie of all time is?
Spoiler Alert: It's "It's a Wonderful Life".
Of course, that's not just my opinion: I can prove it...with Perl.
Christmas Movies
The first thing we need to prove this comprehensively is a collection of Christmas movies.
Did you know that Wikipedia has a list of Christmas movies on it? Neither did I, but I shouldn't be surprised; It has pretty much everything and the kitchen sink on it. Let's scrape that list and put it in a database.
First, we create need a table:
1: 2: 3: 4: |
CREATE TABLE wikipedia_films ( name TEXT, year INTEGER ); |
We can use the on-disk daemonless SQLite SQL database to do this. Most systems ship with the sqlite3
command line tool that'll write the database from the SQL for us:
$ sqlite3 /tmp/db < wikipedia_films.sql
Now we need a program that can scrape the web page and populate the database. Easy-peasy with Mojo::UserAgent, which we've covered extensively in past advent calendars:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: |
use strict; use 5.024;
use DBI; use Mojo::UserAgent;
my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");
my $ua = Mojo::UserAgent->new(); my $res = $ua->get( 'https://en.wikipedia.org/wiki/List_of_Christmas_films' )->result; if ($res->is_error) { die $res->message }
$dbh->begin_work; $res->dom->find('.wikitable tbody tr')->each(sub { my $a = $_->at("td:first-child * a"); next unless $a; my $name = $a->text;
my $year = $_->at("td:nth-child(2)")->text;
say STDERR "Inserting $name ($year)"; $dbh->do( <<'SQL', {}, $name, $year ); INSERT INTO wikipedia_films (name, year) VALUES (?,?) SQL }); $dbh->commit;
|
Now we've got a big old list in our database...but what's the best of these movies?
Downloading The IMDb
If we're being scientific about this we shouldn't just use our opinion. We should use the wisdom of crowds: The Internet Movie Database rating for the film.
So that's straight forward: First we just download the entire of the IMDb...wait, you didn't know we could do that? Sure! The IMDb publishes a bunch of tab separated compressed files of their core data every day.
If we use lwp-mirror to download the files we can mirror the large files to disk.
$ lwp-mirror https://datasets.imdbws.com/title.basics.tsv.gz title.basics.tsv.gz
$ lwp-mirror https://datasets.imdbws.com/title.ratings.tsv.gz title.ratings.tsv.gz
Because we're using lwp-mirror we can safely re-run the downloads as often as we want - a new version will only be downloaded when the contents changes.
Importing into Our Database
Okay, we next need to put all of that data into our database. Let's create a new table:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
CREATE TABLE imdb_films ( title TEXT, name TEXT, year INTEGER, rating FLOAT );
CREATE INDEX imdb_films_title_idx ON imdb_films(title); |
In the database:
$ sqlite3 /tmp/db < imdb_films.sql
And populate it with films:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56: 57: 58: 59: 60: 61: 62: 63: 64: |
use strict; use 5.024;
use DBI; use List::AllUtils qw( zip ); use Mojo::UserAgent; use PerlIO::gzip; use Term::ProgressBar;
my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");
my $total = `gunzip -c title.basics.tsv.gz | wc -l`;
open my $fh, '<:gzip', 'title.basics.tsv.gz' or die "Can't open file: $!";
$_ = <$fh>; chomp; my @headings = split /\t/;
my $sth = $dbh->prepare(<<'SQL'); INSERT INTO imdb_films ( title, name, year ) VALUES ( ?, ?, ?) SQL
my $progress = Term::ProgressBar->new ({ count => $total, ETA => 'linear', }); $dbh->begin_work; while (<$fh>) { chomp; my @row_data = split /\t/;
my %data = zip @headings, @row_data;
next if $data{titleType} ne 'movie'; next if $data{isAdult};
$sth->execute(@data{qw( tconst originalTitle startYear )});
$progress->update($.); } $dbh->commit;
|
And do almost same thing with the ratings for the films:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: |
use strict; use 5.024;
use DBI; use List::AllUtils qw( zip ); use Mojo::UserAgent; use PerlIO::gzip; use Term::ProgressBar;
my $dbh = DBI->connect("dbi:SQLite:dbname=/tmp/db");
my $total = `gunzip -c title.ratings.tsv.gz | wc -l`;
open my $fh, '<:gzip', 'title.ratings.tsv.gz' or die "Can't open file: $!";
$_ = <$fh>; chomp; my @headings = split /\t/;
my $sth = $dbh->prepare(<<'SQL'); UPDATE imdb_films SET rating = ? WHERE title = ? SQL
my $progress = Term::ProgressBar->new ({ count => $total, ETA => 'linear', }); $dbh->begin_work; while (<$fh>) { chomp; my @row_data = split /\t/;
my %data = zip @headings, @row_data;
$sth->execute(@data{qw( averageRating tconst )});
$progress->update($.); } $dbh->commit;
|
The Moment of Truth
Finally, we can categorically prove what's the best Christmas Film
$ sqlite3 /tmp/db
sqlite> SELECT name
...> FROM wikipedia_films
...> JOIN imdb_films
...> USING (name, year)
...> ORDER BY rating DESC
...> LIMIT 1;
It's a Wonderful Life