Bit vectors save space on Santa's list
The number of children in the modern world has strained Santa's meager on-sleigh systems. Back when he first started, he was consulting a scroll the elf handed him as he left the North Pole. When the scroll got too unwieldy, an elf intern created a small Naughty/Nice (NN) database that Santa could install on the ancient navigation system, but space was running tight since Santa had once been told that 640k should be enough for anyone (someone is permanently on the Naughty list). He had plans for upgrades, but there was always something more important. The bar code scanner subsystem he added had cut down on mis-gifting (ever got a present you didn't expect?), but that didn't leave enough space for the NN database. Something needed to happen.
There was an ancient elf, long since retired, who had an idea, and Santa brought him out of retirement for a special assignment. This was a make-or-break effort; mess this up and Santa might have to cut out the parts of the world he couldn't fit into the meager memory he had left. "Greybeard Elf," Santa said, "you have to save Christmas." Greybeard Elf responded "Again?" There's a reason these elves retire.
Greybeard Elf had fixed all the COBOL programs on December 24, 1999, so Santa was expecting big things. Greybeard Elf corrected Santa: "No, you expect small things, and I'm going to use the smallest of small things."
Since every present was already bar-coded and every child had been assigned a special Santa Number, the trick was to see whether that number was Naughty or Nice. Santa doesn't have time for anything fancy. Just tell him yes or no.
Greybeard Elf, versed in the ways of C and Perl, did everything with bit fiddling even before the definition of the byte standardized on 8 bits. If he represented every child as a single bit, he could do it all in about 300 MB on an SD card he could put in a Raspberry Pi he would stick to the sleigh. "Green means Nice, red means Naughty", Greybread Elf said.
Perl, in particular, makes bit fiddling easy through `vec`, a built-in function to manipulate regular strings as if they were bit fields. It lets you set the width of the group of bits (in powers of two up to 64), then access it similarly to a list.
Greybeard Elf showed Santa an example of setting a single bit:
use v5.42;
my $bitmap;
my $WIDTH = 1;
my $pos = 13;
vec( $bitmap, $pos, $WIDTH ) = 1;
say show_bitmap($bitmap);
sub show_bitmap ($b) {
my $bits = 8 * length $b;
my $s = join '',
map { vec($b, $_, $WIDTH) ? '+' : '.' }
0 .. ($bits - 1);
}
This set the 13th bit, extending the string as necessary to access a bit position. The output was a visual representation of the entire bit string, with a dot representing 0 and a plus sign representing 1. The 12th bit, counting from 0, is on (true, 1, whatever):
.............+..
Reading the bit field is just as easy. If you don't assign to `vec`, you get the answer back:
my $nice = vec($b, $pos, $WIDTH);
Short and simple. No need for tricky services or programs. Santa said "Okay, whatever, I just need it to work!" in the way managers try to get out of these sorts of conversations quickly.
Greybeard Elf showed off another trick. These are just strings, so you can change characters in that string. What's the difference between "perl" and "Perl"? It's just one bit:
{
my $bitmap = "perl";
say show_bitmap($bitmap);
vec($bitmap, 5, 1) = 0; # bits are reversed within the byte
say $bitmap; # "Perl"
}
And, this even works on memory-mapped files. For example, you can start with a file that holds a string that is wide enough for all the bits that you might want to set. The string could be all null bytes (which is different than the character "0", which has set bits). Once you have the file, you can then read or set bits, and whatever you do is persistent:
use v5.36;
use File::Map qw(map_file);
my $FILENAME = 'naughty-nice.db';
my $BITS = 56;
{ # this is just to set up the "pre-existing" file
open my $fh, '>', $FILENAME;
print {$fh} "\000" x ($BITS/8);
close $fh;
}
map_file my $bitmap, $FILENAME, '+<'; # file must exist
foreach my $i ( random_numbers($BITS, 13) ) {
vec( $bitmap, $i, 1 ) = 1;
printf "%3d: %s\n", $i, show_bitmap($bitmap);
}
sub random_numbers ($max = $BITS, $n = 10) {
my @a = map { int rand $max } 1 .. $n;
}
sub show_bitmap ($b) {
my $bits = 8 * length $b;
my $s = join '', map { vec($b, $_, 1) ? '+' : '.' } 0 .. $bits - 1;
}
The output shows the bit vector after each bit operation. Note that some bit positions (10, 14, 43) are "set" twice, which just means they remain set. You can only be Nice once, no matter how nice you are:
27: ...........................+............................
54: ...........................+..........................+.
43: ...........................+...............+..........+.
12: ............+..............+...............+..........+.
45: ............+..............+...............+.+........+.
10: ..........+.+..............+...............+.+........+.
43: ..........+.+..............+...............+.+........+.
49: ..........+.+..............+...............+.+...+....+.
35: ..........+.+..............+.......+.......+.+...+....+.
14: ..........+.+.+............+.......+.......+.+...+....+.
10: ..........+.+.+............+.......+.......+.+...+....+.
14: ..........+.+.+............+.......+.......+.+...+....+.
17: ..........+.+.+..+.........+.......+.......+.+...+....+.
But that's not all. Greybeard Elf wants to know Nice children count. You could test each bit individually, but there's a trick with `unpack` that counts the set bits. The `%` provides a checksum of the value (and the `*` is just a repetition of the field):
say "Unique numbers seen: " . count_bits($bitmap);
sub count_bits ($b) { unpack("%32b*", $b) }
I (the author) adapted this example from a particular task I had. Working in a shared container provided to me in which I had no special privileges, I had to count unique IPv4 addresses found in a series of PCAP files. There can be about 3.7 billion of those, excluding the private or reserved addresses. Accumulating these in a hash not only created billions of keys, but each key was itself 4 (packed) bytes. That's a lot of memory, and it was a bit slow.
I only needed to know if a set of IP addresses was part of the data, and I didn't care at all about how many times. In a hash, this would use `exists` to check the key and ignore the value (so undef would be a fine value). With `vec`, I merely look at the bit position that matches the numeric value of the IPv4 address.
And, I could easily save the result and inspect it later without loading a big data file that I would have to parse. I did try freezing with `Storable`, but I knew that wasn't going to work. But it gave me something to do as the problem worked itself out in the back of my head. The output sizes were just too large, and I had too many files to create. With the bit string, roughly one file per every hour of the last four years, with each one file about 400 MB (about 10 GB a day, 4 TB a year).
Part of the problem is that Perl values aren't just values, but carry along all the baggage of Perl SVs, the underlying data structure that tracks all the magic and other things a scalar value knows about itself. This is many times larger than the size of a small value, such as an IPv4 address, and it's repeated for every value. Even an undefined value in an SV takes up quite a bit of space, and a short string takes up much more:
$ perl -MDevel::Size=total_size -E 'my $x; say total_size(\$x)'
24
$ perl -MDevel::Size=total_size -E 'my $x=q(1); say total_size(\$x)'
48
So, using a hash to count even a small fraction of the IP space takes up much more space, and counting a significant portion of it smashes the stack.
- Previous
- Next