2015 twenty four merry days of Perl Feed

Building Santa's Naughty and Nice List with Stepford

Stepford - 2015-12-16

It's a little known fact that Santa's elves are the ones responsible for producing his yearly naughty and nice list. But working on the list has been taking up time that they'd rather use for drinking pine juice and playing Dark Souls. They have a crufty Makefile but it doesn't do a great job of rebuilding things when dependencies change, so they're constantly finding output errors and having to delete old files. It also doesn't play all that nicely with the Perl code they wrote to do the real work.

So the elves pooled their money and hired me to automate building the list. Looking at how they'd built the list before, I realized that Stepford was the perfect tool for the job!

What is Stepford?

Stepford is a tool that takes a set of steps (tasks), figures out their dependencies, and then runs them in the right order to get the result that you ask for. The result itself is just another step that you specify when creating the Stepford::Runner object. Steps are Perl classes built using Moose.

Dependencies and Productions

The "big thing" that Stepford does for you is to figure out the dependencies needed to get to the final step. It does this by looking at the dependencies and productions of all your steps and then running those steps in the necessary order.

Both dependencies and productions are declared as Moose attributes with a special trait. Here's an example:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 

 

has geolite2_database_file => (
    traits => ['StepDependency'],
    is => 'ro',
    isa => File,
    required => 1,
);

has ip_scores_file => (
    traits => ['StepProduction'],
    is => 'ro',
    isa => File,
    lazy => 1,
    builder => '_build_ip_scores_file',
);

 

You'll see how to actually populate the ip_scores_file later.

Stepford matches a production to a dependency solely by name, which means that attribute names for productions and dependencies must be unique to a given set of steps.

Step Classes

A "Step class" is any Moose class which consumes the Stepford::Role::Step role (or another role which in turn consumes that role). This role in turn requires that a step class implement a few specific methods named run and last_run_time. You'll see examples of both of these methods as we go further.

What Goes Into the Naughty and Nice List?

The elves gave me a long list of requirements, but honestly it all seemed like too much trouble. And since these elves are not very technically savvy, I'm going to take the easy route instead and just make some stuff up.

Here's what I'm going to do:

  • Get the names and IP addresses for all the children in the world, or at least a few of them.

  • Assign each child a UUID so I can track them easily.

  • Download the free GeoLite2 database from MaxMind.

  • Use the GeoLite2 database to look at each child's geographical location and use that to give their IP a naughty/nice score. This will be very scientific.

  • Look at each child's name and use that to give their name a naughty/nice score. Again, this will be very scientific.

  • Combine the IP and name scores into a single score per child and generate a text file with the naughty/nice list.

Here's a graph of each step showing each steps' dependencies:

Looking at this graph, you can see a couple interesting things. First, there are two steps, "Get list of children" and "Download GeoLite2 databases", with no dependencies. Next, there are steps that are dependencies for more than one other steps, "Assign UUIDs" and "Get list of children". Finally, the "Combine scores" step has three dependencies but is not a dependency of any other step.

Figuring all this stuff out is what Stepford is for. In fact, it calculates a graph just like this internally.

Building our First Step

Let's start by building the step to "Get list of children". All the step classes for a single set of steps should live under the same namespace. I'm going to use NN::Step as our namespace prefix.


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 
46: 
47: 
48: 
49: 
50: 
51: 
52: 
53: 
54: 
55: 
56: 
57: 
58: 
59: 

 

package NN::Step::Children;

use strict;
use warnings;
use autodie;
use experimental 'signatures';

use Data::GUID;
use MooseX::Types::Path::Class qw( Dir File );
use Text::CSV_XS;

use Moose;

with 'Stepford::Role::Step::FileGenerator';

no warnings 'experimental::signatures';

has root_dir => (
    is => 'ro',
    isa => Dir,
    coerce => 1,
    default => '.',
);

has children_file => (
    traits => ['StepProduction'],
    is => 'ro',
    isa => File,
    lazy => 1,
    builder => '_build_children_file',
);

sub run ($self) {
    my $file = $self->children_file;

    $self->logger->info("Writing names and IPs to $file");

    my $data = do {
        local $/;
        <DATA>;
    };

# CSV line ending per http://tools.ietf.org/html/rfc4180
$data =~ s/\n/\r\n/g;
    $file->spew($data);
}

sub _build_children_file ($self) {
    return $self->root_dir->file('children.csv');
}

__PACKAGE__->meta->make_immutable;

1;

__DATA__
"Alexander Marer",42.235.92.147
"Andrew Bernard Cray",205.145.143.62
...

 

Let's look at the interesting bits more closely.


1: 
 

with 'Stepford::Role::Step::FileGenerator';
 

All Stepford classes must consume one of the Step roles provided by Stepford. This particular role tells Stepford that all of this step's outputs are in the form of files. This lets Stepford calculate the step's last run time by looking at the file's modification time. For non-file steps, you have to provide a last_run_time method of your own.


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 

 

has root_dir => (
    is => 'ro',
    isa => Dir,
    coerce => 1,
    default => '.',
);

has children_file => (
    traits => ['StepProduction'],
    is => 'ro',
    isa => File,
    lazy => 1,
    builder => '_build_children_file',
);

 

This class has two attributes. The root_dir attribute is neither a dependency nor a production. You'll see how to set this attribute later on. The children_file attribute is a production. Some other steps will depend on this production.


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 

 

sub run ($self) {
    my $file = $self->children_file;

    $self->logger->info("Writing names and IPs to $file");

    my $data = do {
        local $/;
        <DATA>;
    };

# CSV line ending per http://tools.ietf.org/html/rfc4180
$data =~ s/\n/\r\n/g;
    $file->spew($data);
}

 

Every Step class must provide a run method. This method is expected to do whatever work the step does. In this case I take the list of children in DATA and turn it into a CSV file.

The logger attribute is provided to each step by the Stepford::Runner class. You'll learn more about that class later.

Atomic File Steps

I could have used Stepford::Role::Step::FileGenerator::Atomic instead. If your step is writing a file, using this role will prevent you from leaving behind a half-finished file if the step dies. I didn't use it in my example code just to keep the code simpler, but I highly recommend it for production code.

More Steps

The other steps are pretty similar. They take some data and spit something new out. Let's take a look at some of the code from the step that adds the UUIDs:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 

 

package NN::Step::AssignUUIDs;

...

has children_file => (
    traits => ['StepDependency'],
    is => 'ro',
    isa => File,
    required => 1,
);

has children_with_uuids_file => (
    traits => ['StepProduction'],
    is => 'ro',
    isa => File,
    lazy => 1,
    builder => '_build_children_with_uuids_file',
);

 

This step depends on the children_file created by the Children step. Stepford will figure this out and make sure that the steps are run in the correct order.

The AssignUUIDs step in turn has its own StepProduction which future steps will depend on.

The remaining steps follow a similar pattern. They take an input file and produce an output file. The last step, WriteList, is a little different, so let's see how:


1: 
2: 
3: 
4: 
5: 

 

package NN::Step::WriteList;

use Moose;

with 'Stepford::Role::Step';

 

The first difference is that I'm consuming the Stepford::Role::Step role instead of Stepford::Role::Step::FileGenerator.

This is mostly so I can demonstrate how to write a last_run_time method.


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 

 

has children_with_uuids_file => (
    traits => ['StepDependency'],
    is => 'ro',
    isa => File,
    required => 1,
);

has ip_scores_file => (
    traits => ['StepDependency'],
    is => 'ro',
    isa => File,
    required => 1,
);

has name_scores_file => (
    traits => ['StepDependency'],
    is => 'ro',
    isa => File,
    required => 1,
);

 

This step has three dependencies, unlike the previous steps you've seen. Each of these dependencies comes from a separate step. Stepford will figure all that out for us and run those steps before this one.

And here's the last_run_time method:


1: 
2: 
3: 
4: 
5: 
6: 

 

sub last_run_time ($self) {
    my $file = $self->_naughty_nice_list;
    return undef unless -e $file;

    return $file->stat->mtime;
}

 

This is pretty straightforward. If the file exists, I return its last modification time. If not, I return undef.

Stepford uses the value of each step's last_run_time to determine whether or not a given step needs to be run at all. If the data in a dependency is newer than the data in the step that depends on that data, there's no point in regenerating the dependency's data.

(By the way, the last_run_time method above is essentially the same as the one in Stepford::Role::Step::FileGenerator.)

Running Your Steps

Now that I've written my steps, how do I run them? Here's the script I wrote:


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 

 

#!/usr/bin/env perl

use strict;
use warnings;

use FindBin qw( $Bin );
use lib "$Bin/../lib";

use Getopt::Long;
use Log::Dispatch;
use Stepford::Runner;

sub main {
    my $debug;
    my $jobs;
    my $root;

    GetOptions(
        'debug' => \$debug,
        'jobs:i' => \$jobs,
        'root:s' => \$root,
    );

    my $logger = Log::Dispatch->new(
        outputs => [
            [
                'Screen',
                newline => 1,
                min_level => $debug ? 'debug' : 'warning',
            ]
        ]
    );

    Stepford::Runner->new(
        step_namespaces => 'NN::Step',
        logger => $logger,
        )->run(
        config => { $root ? ( root_dir => $root ) : () },
        final_steps => 'NN::Step::WriteList',
        );

    exit 0;
}

main();

 

The only interesting piece is my use of Stepford::Runner.


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 

 

Stepford::Runner->new(
    step_namespaces => 'NN::Step',
    logger => $logger,
    jobs => $jobs // 1,
    )->run(
    config => { $root ? ( root_dir => $root ) : () },
    final_steps => 'NN::Step::WriteList',
    );

 

The Stepford::Runner constructor takes several named arguments. The step_namespaces argument tells Stepford under what namespace it should look for steps. It will load all the classes that it finds under this namespace.

You can pass multiple namespaces as an array reference. When two steps have a production of the same name, then the step that comes first in the list of namespaces wins. This is useful for testing, as it lets you mock as many steps as you need to.

The logger can be any object that provides a certain set of methods (debug, info, etc.).

Finally, if you set jobs to a value greater than one, Stepford will run steps in parallel, running up to $jobs steps at once whenever possible.

The call to the run method also accepts named arguments. Keys in the config argument which match constructor arguments for a step will be passed to that step class as the step is constructed. Remember way back up above when I mentioned that I'd show you how to set the root_dir attribute of the NN::Step::Children class. This is how you do that.

The final_steps argument can be a single step class name or an array reference of names. This is how you specify the result you're asking Stepford for.

Why Stepford?

Stepford is lot like make, rake, and many other tools. Stepford was originally created to help improve our automation around building GeoIP databases at MaxMind.

I investigated make and rake, which are both great tools. However, what makes them shine is how they integrate with certain environments. The make tool is great if you're interacting with a lot of existing command line tools like compilers, linkers, etc. And of course rake is great if you're dealing with existing Ruby code.

But our database building code was is written in Perl, so it made sense to write a tool in Perl.

If you're in a similar situation, with a Perl code base that executes a series of steps towards one or more final products, then Stepford might be a good choice for you as well.

It certainly worked well for those elves. Sure, the naughty and nice list they get is complete and utter nonsense, but it's a lot quicker to generate, giving them more time for their pine juice-fueled Dark Souls speedruns.

The Code

If you want to see all the step code for this article, check out this article's GitHub repo.

See Also

Gravatar Image This article contributed by: Dave Rolsky <autarch@urth.org>