2016 twenty-four merry days of Perl Feed

Warm and Fuzzy Inside

Text::Fuzzy - 2016-12-18
  He knows when you're asleep
  He knows when you're awake
  He knows when you've been bad or good
  So be good for goodness sake

Sounds pretty nefarious, right? Jolly old elf indeed. Just how does Nick know so much about everyone? Truth be told, he has many proprietary mechanisms of collecting data in house but as is so often the case with large endeavours, he subcontracts much of the work out to other parties. In this case, he purchases commercial data sets from market research firms which you would think would lighten the load on his data mining elves however most of his contractors aren't as rigorous about data cleaning and normalization as they are.

Fortunately, the North Pole Crew have some experience with imprecise matching and due to Santa's poor spelling. In fact, earlier this month we saw an article on how they were testing modules for calculating Levenshtein distance - the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other - to help work out how close the two strings are together. With this could the Elves splice their data and the commercial data sets together?

Bad Santa

The elves did some testing and sadly it didn't look good. The problem the Elves were facing is that Levenshtein distance alone wasn't being smart enough for them. Consider the following corruptions:

Original Corrupted Levenshtein Distance
santa snata 2
santa sandy 2

In the first example the an from Santa's name simply got swapped to become na, whereas the second example requires two whole new characters replaced two old characters. However: these two examples have identical Levenshtein distance scores - how could the elves get their code to prioritize one over the other?

Enter Text::Fuzzy

Text::Fuzzy is a fuzzy text matcher that can not only calculate Levenshtein distance, but is also able to calculate Damerau-Levenshtein distance - counting character swaps as just one step. This gives us this much better looking result.

Original Corrupted Damerau-Levenshtein Distance
santa snata 1
santa sandy 2

As a bonus the interface for Text::Fuzzy is also surprisingly simple:

say Text::Fuzzy->new( $corrupted, trans => 1 )
               ->distance( $original );

(Note the use of the trans option there to enable Damerau-Levenshtein scores.)

Having Text::Fuzzy pick the one of the closest matches from a list is also really easy:

say scalar Text::Fuzzy->new('blixen')->nearestv([qw(
   dasher dancer prancer vixen comet
   cupid donner blitzen rudolph
)
]);

Putting It All Together

With this tool in in mind, the Elves knocked together a proof-of-concept for combining the two data sets. The code takes two disparate CSVs and combines them into a single CSV - fuzzily!

#!perl -l
use Text::Fuzzy;
use Text::xSV;
use Inline::Files -backup;
use List::Util 'max';

# Minimum match level
my $threshold = 85;

# Configure to read the two CSVs embedded at the bottom of the
# example code. One representing the Polar data and another
# representing the third party vendor data
my $PDB = Text::xSV->new(fh=>*POLARDB); $PDB->read_header();
my $VDB = Text::xSV->new(fh=>*VENDORDB); $VDB->read_header();

# Print combined header row
my @phead, @vhead;
print join(',', @phead=@{$PDB->{row}}, 'Match', @vhead=@{$VDB->{row}} );

# Load all the vendor data into memory
my @VDB;
push(@VDB, $_) while( $_ = $VDB->fetchrow_hash() );

# Process all the polar CSV lines and look for something we're reasonably
# confident is a match
while( my $polarRecord = $PDB->fetchrow_hash() ){
  my $key = serializePolarKey($polarRecord);
  my $keyLen = length($key);
  $TF = Text::Fuzzy->new($key, trans=>1 );

  my @matches;
  foreach my $vendorRecord ( @VDB ){
    my $dist = $TF->distance( serializeVendorKey($vendorRecord) );
    my $confidence = int( 100 * ($keyLen-$dist)/$keyLen);
    push @matches, $confidence;

# Alternatively, stash match, and only return highest match after loop?
print join(',',
               @{$polarRecord}{@phead},
               $confidence,
               @{$vendorRecord}{@vhead} ) if $confidence >= $threshold;
  }
  unless( max(@matches) >= $threshold ){
    print join(',', @{$polarRecord}{@phead}, max(@matches));
  }
}

sub serializePolarKey{
  return sprintf("%s %s %s,%s, %s, %s %s, %s",
                 $_[0]->{FirstName},
                 $_[0]->{MiddleName},
                 $_[0]->{LastName},
                 $_[0]->{Street},
                 $_[0]->{City},
                 $_[0]->{Province},
                 $_[0]->{PostCode},
                 $_[0]->{Country} );
}

sub serializeVendorKey{
  return $_[0]->{FullName} . ','. $_[0]->{Address};
}

__POLARDB__
ID,FirstName,MiddleName,LastName,Street,City,Province,PostCode,Country
1,John,Jacob,Jingleheimer-Schmidt,1320 Needham Drive,Centreville,NV,89119,USA
2,Bartholomew,Jay,Simpson,1322 Evergreen Terrace,Springfield,,,USA
3,Marie Claire,,Dubois,62 Clos du Bois Rossia,Namur,,5017,BE
__VENDORDB__
FullName,Address,ExtraData
Jon Jacob Jingleheimmer Schmitt,"1320 Needham Dr, Centerville, NV 89119, USA",Tm90aGluZyB0byBzZWUgaGVyZS4=
Bart Simpson,"1322 Evergreen Ter., Springfield USA",SSdtIGdldHRpbicgbm91dHRpbicgZm9yIENocmlzdG1hcw==
Mary Claire Dubois,"62 Clos duBois Russia,Namur,,5017,BE",dGl0aSB0b3RvIHR1dHUgdGF0YQ==
Marie Claire du Bois,"26 Clos du Bois Rossia,Namur,,5017,BE",SidhaW1lIGJpbmUgbGVzIGhpc3RvaXJlcyBkZSBHYXN0b24gTGFnYWZmZQ==

Output

  ID,FirstName,MiddleName,LastName,Street,City,Province,PostCode,Country,Match,FullName,Address,ExtraData
  1,John,Jacob,Jingleheimer-Schmidt,1320 Needham Drive,Centreville,NV,89119,USA,89,Jon Jacob Jingleheimmer Schmitt,1320 Needham Dr, Centerville, NV 89119, USA,Tm90aGluZyB0byBzZWUgaGVyZS4=
  2,Bartholomew,Jay,Simpson,1322 Evergreen Terrace,Springfield,,,USA,71
  3,Marie Claire,,Dubois,62 Clos du Bois Rossia,Namur,,5017,BE,85,Mary Claire Dubois,62 Clos duBois Russia,Namur,,5017,BE,dGl0aSB0b3RvIHR1dHUgdGF0YQ==
  3,Marie Claire,,Dubois,62 Clos du Bois Rossia,Namur,,5017,BE,85,Marie Claire du Bois,26 Clos du Bois Rossia,Namur,,5017,BE,SidhaW1lIGJpbmUgbGVzIGhpc3RvaXJlcyBkZSBHYXN0b24gTGFnYWZmZQ==
Gravatar Image This article contributed by: Jerrad Pierce <jpierce@cpan.org>