2014 twenty-four merry days of Perl Feed

Finding CPAN distributions with github repositories

MetaCPAN::Client - 2014-12-15

Blink the elf wasn't happy, because Father C wasn't happy. Blink had been told to send presents to every CPAN author who released a new distribution (aka neocpanism) to CPAN this year, regardless of who had done the most recent release of the distribution. But the distribution had to still be on CPAN.

"Does SHARYANTO get one for every dist?!", Blink had asked, but got a cuff round the ear for that.

He thought he'd done a good job, but the Senior Elf had just been to tell him that some authors had missed out.

Father C doesn't like it when people don't get the presents they should.

What went wrong?

Blink had remembered that they have their own CPAN mirror at the North Pole, so he just used Path::Iterator::Rule to iterate over it, looking for the META.yml files which hold the metadata for each distribution. From those he pulled out the name key, which holds the distribution's name.

He'd remembered that they have a database of CPAN authors and their neocpanisms, so he'd used that get the other information he needed.

But on closer inspection, he discovered that not all distribution names from metadata files matched those in the database. "Sort it out, and send some pull requests!", the Senior Elf had chastised him.

Distribution names

When you create a module Foo::Bar, you would normally release it to CPAN as part of a distribution called Foo-Bar. You can specify the distribution name yourself, but distribution builders like ExtUtils::MakeMaker will produce the dist name for you, based on the module name. The distribution name ends up in the metadata file.

When you release a version of your distribution, the tarball usually has a name built up from the distribution name, the version, and whatever extension is used for the archiving method you used. So Foo::Bar 0.01 would probably be released in Foo-Bar-0.01.tar.gz. If Blink had released this, the location of this release on CPAN would be:

 B/BL/BLINK/Foo-Bar-0.01.tar.gz

These paths turn up everywhere in the CPAN ecosystem, so there's a module CPAN::DistnameInfo which takes a path and picks it apart. As a result, most parts of the ecosystem assume that the dist name inferred by CPAN::DistnameInfo is the right distribution name.

The problem can arise where the release name isn't based on the distribution name, which might be for a variety of reasons. Sometimes the filename has the right dist name, and it's the metadata that doesn't.

Finding the problem distributions

"Talk to Olaf", Senior Elf had told Blink. But after a frustrating ten minutes, Blink went back to SE, complaining that Olaf just sang about summer. "No, not Olaf the Snowman, I meant Olaf Alders!".

With Olaf's help, Blink wrote the following, to iterate over all CPAN distributions, using the MetaCPAN API:

use MetaCPAN::Client;
my $client = MetaCPAN::Client->new();
my $query = { all => [
                      { status => 'latest' },
                      { maturity => 'released' },
                 ]};
my $params = { fields => [qw/ metadata download_url /] };
my $result_set = $client->release($query, $params);

while (my $release = $result_set->next) {
}

The maturity => 'released' says that you don't want developer releases, and status => 'latest' says that you only want the latest release for each distribution.

You can look at the release data for a distribution using the API in a browser, for example https://api.metacpan.org/release/URI-Title. You'll see there's a lot of information. If you're only interested in some of it, you can just request specific fields (the $params hashref above, passed to the release method. The $release above is a hashref, which contains slices of the data you saw from the API.

The two bits of interest here are:

my $distname = $release->metadata->{name};
# "URI-Title"

my $path = $release->download_url;
# "http://cpan.metacpan.org/authors/id/B/BO/BOOK/URI-Title-1.89.tar.gz"

Notice that download_url is a direct accessor. The metadata accessor returns a hash reference, which you can drill into to get the specific bits of metadata you're after.

You can then give that $path to CPAN::DistnameInfo, and the dist method tells you what it thinks the distribution name is. So Blink could compare that with the name from the metadata file, which handily is also available:

use CPAN::DistnameInfo;
my $dinfo = CPAN::DistnameInfo->new($path);
push(@bad, $path) if $dinfo->dist ne $distname;

Handily, MetaCPAN has already done that conversion for you, and the distribution name according to CPAN::DistnameInfo is available in the distribution field from the API, and the accessor of the same name. So the above code becomes:

push(@bad, $path) if $release->distribution ne $release->metadata->{name};

So now Blink could build a list of problem distributions, but he'd been told to send some pull requests as penance. How could he find out which of the distributions were on github?

He found an article which explained how the github repository can be recorded in a distribution's metadata. If you look in that API output, you'll find it under the resources key.

You can use this fact to further constrain your MetaCPAN search, so that you only consider distributions with a github repo in their metadata:

my $query  = { all => [
                  { status => 'latest' },
                  { maturity => 'released' },
                  { 'resources.repository.url' => '*github*' },
             ]};

Blink showed his code to Olaf, who explained that when searching on MetaCPAN you can specify which fields you're interested in from the API, and you'll only get those. This results in a lot fewer bytes coming back over the wire.

They paired up to refactor Blink's code, and ended up with this:

use MetaCPAN::Client;
my $client = MetaCPAN::Client->new();
my $query = { all => [
                     { status => 'latest' },
                     { maturity => 'released' },
                     { 'resources.repository.url' => '*github*' }
                 ]};
my $params = { fields => [qw/ metadata distribution /] };
my $result_set = $client->release($query, $params);

while (my $release = $result_set->next ) {
    if ($release->distribution ne $release->metadata->{name}) {
        printf "Distribution '%s' has name '%s' in metadata\n",
               $release->distribution,
               $release->metadata->{name};
    }
}

Blink completed his task, and TEODESIAN and other authors got their gifts.

See Also

Gravatar Image This article contributed by: Neil Bowers <neil@bowers.com>