Where Does He Get Those Wonderful Codepoints?
𝔖𝔱𝔲𝔭𝔦𝔡 𝔘𝔫𝔦𝔠𝔬𝔡𝔢 𝔗𝔯𝔦𝔠𝔨𝔰
I got interested in understanding how Unicode works for two reasons: I needed to understand it to do my job, and it gave me an improved ability to do stupid things on Twitter. Mostly, today I want to talk about the latter.
So, for example, how did I make that stupid section header above? (By the way, if you can't read it, you should probably go install a more complete Unicode font, like Symbola!) Well, I knew that Unicode 6 had introduced Fraktur characters, and if I knew how they were named, I could start from there, so I used the uni
command provided with App::Uni:
1: | $ uni fraktur |
This a stupidly useful tool for being a goofball. Did you know that I mostly use my programming skills to make dumb jokes on the internet? Yeah, this is your surprised face: 😐
1: | $ uni neutral |
The 𝑁𝑒𝑥𝑡 Level
So, uni
is great for finding characters quickly by name. Sometimes, though, you want to find characters based on other criteria. For example, when I see somebody trying to use m{\d}
to match ASCII digits, I want to give them an example of some of the things that they really don't think should be matched. Tom Christiansen is one of Perl's chief Unicode gurus, and he's written a bunch of weird and useful tools. brian d foy packaged those up and now it's easy for anybody to install them. One of the tools is unichars
, which lets you find characters based on many more criteria than their names. For example, to enlighten the guy using m{\d}
, I want to find digits that aren't in [0-9]
and I'm going to pick the seven from each script, because seven is a funny number:
1: | $ unichars '\d' '$_ !~ /[0-9]/' 'NAME =~ /SEVEN/' |
I just specified my three criteria as arguments to the command:
a digit:
\d
not in [0-9]:
$_ !~ /[0-9]/
seven:
NAME =~ /SEVEN/
You could also replace that second rule with, say, '\P{ASCII}'
. There's more than one way to do it. There are some important things to know, though. By default, for example, unichars
won't search the supplementary multilingual plane or the so-called "astral" plane. That means that this vital search will fail:
1: | $ unichars 'NAME =~ /WEARY/' |
You meant:
1: | $ unichars -a 'NAME =~ /WEARY/' |
Also, the sophistication of unichars
comes at a price: speed. Searching for that weary cat face with uni
takes about 0.144s on my laptop. Searching with unichars
, 14.899.
When you're just trying to make a stupid joke on Twitter, those 14 seconds aren't worth spending. On the other hand, when you need to actually search for characters matching certain criteria, unichars
can do it, and uni
can't.
Two More Tools
Unicode::Tussle comes with a bunch more tools, but I'll just show you two more, briefly. uninames
is a bit more like uni
in that its primary job is to search character descriptions, but its searches aren't limited to character names. It looks at the whole description. For example:
1: | $ uninames face |
We get SURFACE INTEGRAL
because it matches /face/i
, but why do we get SEGMENT
or WATCH
? It's because they're related to character with "face" in their descriptions. Sometimes, you'll get a match against a comment about the characters' usage:
1: | $ uninames fraktur |
Finally, there's uniprops
. This is a wonderfully useful tool, in very limited circumstances. Most often, for me, it comes up when I've got some weird input. Say some user's name isn't working with some data validator. The guy who wrote that validator required names to be (for some insane reason) Latin letters that were either uppercase or lowercase. Our other input validator only has the first half of that constraint: Latin letters. What's happening? Well, the user causing us problems is going by the name “ᴿᴶᴮˢ” – what can uniprops
tell us about those?
1: | ~$ uniprops <1d3f> |
So, it's Latin, a Letter, and Lowercase, but not a Lowercase_Letter. What?? Surely this is an anomaly..? We can find out with unichars
!
1: | $ unichars '\p{Letter}' '\P{Lowercase_Letter}' '\P{Uppercase_Letter}' '\p{Latin}' |
What's the lesson here? You and the other guy should learn what those categories mean. Once you've started doing that, you'll be well on your way down the rabbit hole, and you'll start to find all new uses for these tools… but none is likely to be as much fun as making stupid Tweets.