Hacker News new | past | comments | ask | show | jobs | submit login
Understanding and avoiding visually ambiguous characters in IDs (gajus.com)
279 points by gajus 11 days ago | hide | past | favorite | 200 comments





I had this exact situation at work when they shipped millions of devices with serial numbers, and didn't leave out any letter or number. Customers had so much trouble reading them accurately, I had to make a regex script that generated every possible typ0 permutation of what the customer said, and then it would list only matches from the factory database. From there, folks would try to correlate other info like dates to figure out what their real serial number probably was. It was a nightmare. Ironically several of the digits never changed, and some were just 0 1 or 2 to represent which factory made it, so there was no need for the entire character set in the first place. They seem to have been convinced we'd produce 8 quadrillion devices.

> They seem to have been convinced we'd produce 8 quadrillion devices.

While I'm not arguing that their decisions were wise nor that they shouldn't have been able to foresee and prevent the issues they caused you and your colleagues, I would add this one thought in response to the line quoted:

It's often either beneficial or at least considered beneficial to prevent business information leaking through serial numbers, the simplest example being that if you start labelling your products with 1, 2, 3.. and never deviate, then it's fairly easy to take a sample of not many serial numbers and estimate how high they go and therefore how many have been sold. Sometimes it can also be beneficial to make it harder to guess a valid serial number (eg it prevent customers from pretending to have a valid one to get a refund, or whatever).

Of course, even if you have these concerns and want to mitigate them, it doesn't prevent you from also taking steps to prevent difficulty reading the correct characters. If anything it should make them more aware of the potential issues you faced since it means someone is already actually thinking specifically about what system to use, as opposed to what likely happened in your case of someone spending 30 seconds going "we need serial numbers, using X digits means we'll never run out, job done".


> if you start labelling your products with 1, 2, 3.. and never deviate, then it's fairly easy to take a sample of not many serial numbers and estimate how high they go and therefore how many have been sold.

Also known as the German Tank Problem[1].

https://en.wikipedia.org/wiki/German_tank_problem


I bought some software many years back. The serial number had 6 or so digits in it. At one point, I contacted the developer for some other purpose, and pointed out that I had made my purchase as soon as I heard about the product. He told me I was the first customer, and that he had decided to make up long serial numbers to avoid this counting problem.

It also works great as a checksum. See IBAN numbers for a great example - they are all multiples of 97 plus 1 which makes accidental typos much less possible.

Come to think of it, I wonder if this is why (or a factor in why) Apple serial numbers don't have any vowels in them.

I think only consonants and digits are used in device serial numbers.


Encoding should also depend on the user. Base 32 (crockford & rfc 4648) has a nice unambiguous alphabet for compact representation and explanation of why. However if your users are speaking aloud you might want a word list representation, “TIDE ITCH SLOW REIN RULE MOT”, like s/key rfc 1751. DO NOT invent your own word lists; there are an infinite number of dragons lying in wait for idioms, homophones, dialects, etc. Dont be like me and unintentionally create a major incident like “wet clam butterfly.”

> However if your users are speaking aloud you might want a word list representation, “TIDE ITCH SLOW REIN RULE MOT”, like s/key rfc 1751. DO NOT invent your own word lists; there are an infinite number of dragons lying in wait for idioms, homophones

An unfortunate example. That's TIED HITCH SLOE REIGN RULE MOW? With only two parity bits, you can't even be sure this decoding is invalid.

RFC 1751 [0], from which this example comes, doesn't envisage the encoding being used in oral communication. Instead, it makes codes easier the user to "read, remember, and type in".

For oral transmission among professionals, sticking to the 26 upper case letters and relying on the NATO alphabet for encoding is a reasonable choice. Getting codes from untrained users in a lossy oral environment is still an unsolved problem.

[0] https://datatracker.ietf.org/doc/html/rfc1751


It would help if nato alphabet was universally known thing.

Typing something letter by letter in Latin when neither party is a native speaker of English is very much painful almost half the time it happens


My personal experience says that the most commonly understood phonetic alphabet in the US among laypeople is the 1946 ARRL alphabet using American first and last names, for example A as in Adam, N as in Nancy. NATO phonetic alphabet confuses almost everyone I've tried it on.

https://en.wikipedia.org/wiki/Spelling_alphabet


Everyone I've run into in the hospitality industry gets NATO phonetic. Hotels and airlines, in my experience, but I assume it generalizes.

My wife thought it was crazy the first time she heard me use it. Then she realized that they all understand it too.


Not to mention accents.

Some people are going to sound like they’re saying Todd for Tide, and have you heard how Baltimore pronounces Iron?


Gotta cut it some slack since it’s from 1994, but still that’s a humorously bad RFC:

> These require use of a keyed message-digest algorithm, MD5 [Riv92] […] while sufficiently strong […]

Heh!

> […] is hard for most people to read, remember, and type in.

Ok, go on…

> English words are significantly easier for people to both remember and type.

Most people don’t know English.. But that shouldn’t be a problem since the word list can be changed. Right?

> Because of the need for interoperability, it is undesirable to have different dictionaries for different languages.

Oh. Well the world already learned the 26 characters of the English alphabet so adding a few words is probably fine..

> char Wp[2048][4] = […]

Oh, well at least it’s common words suitable for English beginners?

> WAD, BESS, MERT…

Hold on, these words are tricky even for…

> ORR? AGEE EGAN HAAS!!

…Are you done?

> GAUL FLAM! DRAB!


What's this type of IDs called?


This brings up memories.

One day while sick, I distracted myself from being sick by writing up a silly module to do arithmetic in arbitrary bases. And, because it was easy I stuck it on CPAN. https://metacpan.org/pod/Math::Fleximal is the module.

Of all of the silly things I'd done, I would have sworn that this is the one that should never generate a support request. But it did! Why? Well I'd included a demonstration of how to turn hexadecimal into an alphanumeric code. And someone had the bright idea of using the same thing to turn long numbers into readable codes!

My module worked, but I was still a bit flabbergasted that THIS wound up in production somewhere!!


The author makes a point of avoiding letters that are hard to distinguish even when spelled out in handwriting, but the example table includes the number 7. I can not count the number of times I have found it hard to distinguish between someone's 7 and 1.

It helps if you draw a horizontal bar on the 7 but many don't, so you can never really be sure if a 7 is in fact a 1 with the serif or vice versa.


I never ran into into this situation, but I plan to update the article based on aggregated feedback. A few good suggestions have been made.

It might be based on the handwriting standards used in your country. Where I live we were taught at school to draw a horizontal bar on 7 and avoid the serif on 1:

https://is.mediadelivery.fi/img/468/a93c32e08dae4768869a4bda...

No chance of confusion. This seems to have prompted some to add the serif to their 1 for stylistic reasons or whatever, since it's still distinguishable from 7 with a bar.

But then again people following older or newer conventions drop the bar from their 7:

https://is.mediadelivery.fi/img/468/46827e3320294f89b12a9338...

This makes a singular 1 with sloppily drawn serif hard to distinguish from a 7 without horizontal bar unless you can also see how the same person draws the other digit in their style.


An alternative way, that makes the "1"s a bit less ambiguous, is to draw a bar at the bottom. So even if you put the serif on the 1, and write it sloppy, you still have the bar at the bottom.

See the last example in this image:

https://upload.wikimedia.org/wikipedia/commons/thumb/e/ee/Ha...

Side note to OP and author, the Wikipedia page is pretty handy and has a lot of info:

https://en.wikipedia.org/wiki/Regional_handwriting_variation


Where I grew up (Korea), we write 7 with an extra serif at the upper left corner, like this: https://pop.yesform.com/pop/16113

It never gets confused with 1, but in America, people were confusing it with 9 (!!), so I had to stop writing it like that. Can't please everybody...


I can see it as a native-born American.

My handwriting has always been pretty sloppy. My 9s come out like your 7s when I don't close the loop properly (I start at the bottom).

People confuse my lowercase r's for n's all the time too for a similar reason. Either I loop a little too much or I drag down the overhang so it basically is an n.


Updated the article. Thanks for the context

When it comes to handwritten numbers, Brits frequently mistake German ones for sevens, and Germans British sevens for ones.

A small typo I noticed - "Case-sensitive: 53^5 = 62,259,690,411,360" should be to the eighth power, not the fifth.

Thanks. Fixed

Suggestion: after "a longer ID with a lower chance of visual ambiguity" show how many characters that will be needed to have the same number of IDs as 53^8 using the 22 encoding.

I.e. for a given number of IDs, how many characters are needed in the 53 versus 22 encoding (people who are not good at math might assume it is more than twice as many).


Actually, 53^8 = 62,259,690,411,361 (not ..360)

The article also mentioned the difficult-to-distinguish aurally "B" (Bravo) and "P" (Papa).

But it did not mention the most similar-sounding pair "F" (Foxtrot) and "S" (Sierra), which are nearly indistinguishable.

While one could use the NATO/Aviation standard alphabet (Alpha, Bravo, Charlie, Delta...), unless you have a very specifically constrained customer base,it won't help much. Best to also avoid those combinations.

Definitely better to have a slightly longer ID_String and maximal ability to read and speak/hear the characters. It'll save FAR more time and aggravation.


There are many of these ambiguous pairs: B/P, F/S, D/T, M/N, Q/U, ...

The end-to-end transmission can get really bad when you combine several different filter stages, such as a speaker's mouth being injured or obscured, a narrow channel like telephone or radio, noise, and a listener's ear losing parts of the spectrum.

As the sound transmission gets worse, you can get more rhyming ambiguities. Effectively, the consonants are lost in a bad channel and only the vowels come through. In an American English accent, I think these are the groups corresponding to different vowel sounds: A/H/J/K, B/C/D/E/G/P/T/V/Z, I/Y, O, Q/U, F/L/M/N/S/X, R. "W" stands alone with multiple syllables.

Depending on the kind of transmission problem, these groups can start to split apart into smaller subgroups based on which of their sonic differences make it through to the listener.


> But it did not mention the most similar-sounding pair "F" (Foxtrot) and "S" (Sierra), which are nearly indistinguishable.

My family name begins with a 'F' and, indeed, I can't count the number of times where people write a 'S' instead. I've got invoices with a 'S' instead of a 'F'!


That's interesting. I've never encountered a 1 that looks like 7 in handwriting. Usually it's I and l that mess with 1. In what style of handwriting is 1 similar to 7? I'd imagine the top bar on 7 is a sufficient differentiator.

If you don't have any 7s in the text (and 1s only - or vice versa!), it's hard to say what they are. I did encounter this multiple times.

>I've never encountered a 1 that looks like 7 in handwriting. [...] In what style of handwriting is 1 similar to 7? I'd imagine the top bar on 7 is a sufficient differentiator.

Here's a deep link to someone in Germany writing down what visually looks like "77.5 :7:7" but his narration says it's actually "11.5 :1:1"

https://www.youtube.com/watch?v=TT9je5yo7yM&t=30m44s


This just looks like obviously 11.5 :1:1 to me, the slant would be totally wrong for 7s. I had to check back your comment to be sure you were really talking about these 1s as looking like 7s :)

But this thread reminds me of when I lived in Canada for a while (coming from France) and I did misread numbers very often, which was totally unexpected to me. Yes, 7s and 1s looks very different between Canada (and the US I guess) and France (and probably the rest of Europe).

I haven't had this problem with Belgium though I'm not surprised if the standard here had been chosen to be the same as in France.


They might be obvious ones in the context of this one person. But they are trivially not obvious next to someone who writes one like "|" and then seven is just "|" with any sort of hat. Your slant heuristic immediately fails.

It's "obvious" because 7 is always slanted here. But I know it's not the case in North America and I have a good experience on how numbers can be misinterpreted, as I said.

I was just saying it was obvious to me and it even takes effort to see how they could be misinterpreted. But I know they can be.


in some countries' handwritings the digit one is not a vertical bar but it has a little ascending hook, like a digit seven turned vertical, but with a shorter roof.

so 'muricans mistook my German ones for sevens, all the time, and I had to force myself to write what looks like a pipe symbol vertical bar to me instead of my trusted one.

and to disambiguate, we cross the seven like a lower case eff or tee is crossed.


The handwriting of numbers and letters being confusing between countries is something that's easy to not think about until you've actually faced the issue multiple times.

I'm English, and I can't honestly remember which country it was that I've lived in (I think France...) where there were a couple of numbers that even after living there for a year I still wasn't confident reading when hand-written on things like café menus. And I don't think I would have thought of that being a systemic issue rather than just blaming an individual's handwriting before I lived there, despite having taken over 100 trips to France before moving to live there for a year.


Germans write the number 1 almost like an upside-down capital V. It’s not horizontally symmetrical though, which is why it looks like a 7.

A "1" can have a little squiggly roof on it. A big 1-squiggle easily looks like a 7.

Fascinating!

I was born in Europe so I put a horizontal line midway through 7. But now I'm in Canada and nobody else does. It can be a really tiny angular difference between a 1 and a 7 for a lot of people! :)


Same experience, I wilfully switched my handwriting to American 1 (one) as a single vertical line with the European 7 (seven) having an horizontal line midway for disambiguation in a multicultural work environment.

Crossed 7's are fairly common among science majors in American universities. I also cross z's. Again, also fairly common among science majors. (Mine was chemistry.)

7, 1 I i and l are troublesome because sans serif vs serif fonts and other stylistic choices can make them look like eachother.

Missing in the first part, but In the section "Visually ambiguous dictionary" neither 1 nor 7 is present.

If you use both upper and lower case, you are likely to eventually be surprised by some third party system or protocol that is case insensitive. I even found a commercial system which allowed users to choose IDs with case sensitivity (iD and id being distinct) but if you query it for one which does not exist they do case insensitive matching and return the wrong data.

When I reported this bug they said it was for convenience!


Thanks for the anecdote. I've included it in the article.

What a nuts system

I thought this was good neat UX: on the Nintendo Switch I was entering a serial number for some DLC, and the on-screen keyboard had all the ambiguous character keys disabled, which means that the serial numbers are generated without any ambiguous characters.

I'm not sure if this UX was built into the OS, or just part of the game I was playing (Mario + Rabbids Sparks of Hope).


KeepassXC (open source password manager application) uses colour to make passwords more readable. They use one color for each "class" of character: uppercase, lowercase, numbers, symbols, ...

This is a extremely simple idea, but especially with random passwords this helps a lot even if the font is already hyperlegible.


Bitwarden also uses an unambiguous font with 3 colors (default for letters, blue for numbers, red for symbols); I love it. It baffles me when any password-focused software allows itself to render characters in an ambiguous font without any color differentiation.

You can also add a list of exclusions easily in the KeepassXC password generator. I do because when you type in a long password on a TV remote, or similar interface, and then realise the l1|I were confused it's soooo0 infuriating.

As a colorblind person I hate this idea.

I advocate for accessibility and inclusivity constantly, but not implementing additional measures which are helpful to most due to some not being able to make use of that one aid is not the way to go. Direct your hate elsewhere.

Yeah, why? Because the additional information layer benefits some people? Depending one your type of color-blindness and the choice of colors this might even be an improvement that works for color-blind people.

We are not talking about encoding information only in color (= bad idea), we are talking about encoding information that is already present additionally in the color. And if your app has accessability settings (it should) this would be a thing that you could switch on and off.


It's an additional layer on top of other ones like using a non-ambiguous font, large size display, alternating background shades, character index numbers under each character, etc.

So cool to read an article discussing a problem I run into on a regular basis.

Whenever I'm creating a 2FA backup on a piece of paper, anxiety hits me every time I cross over certain characters, o/0, v/u, 5/S, etc. I've come to add some fanciness to how I write these characters for this exact reason.

On "Phonetic similarity", reminds me of how I chose my wifi password. I wanted a common word with multiple consonants that a 3rd grader could spell, so I could share the password with a single phrase and have it be unambiguous. Ended up choosing "vacation".


> Whenever I'm creating a 2FA backup on a piece of paper, anxiety hits me every time I cross over certain characters, o/0, v/u, 5/S, etc. I've come to add some fanciness to how I write these characters for this exact reason.

My convention is that I put a dot '.' below every digit (this solves the 5/S, 0O, 8/B etc. issues [the actually problematic ones shall depend on your handwriting]).

If I'm really unsure, I add the NATO/aviation alphabet [1]. There's a 'U', I'll write 'Uniform' (in diagonal, starting from the 'U').

It only requires some discipline. I've done that since more than ten years now, never lost a single 2FA code.

[1] nitpicking about the actual difference between the NATO and aviation codes can safely be send to /dev/null


I can’t believe people out there write these things down by hand on paper.

It’s mind bottling.


Damn, being psychic must be cool. I think your mind may be boggled though.

I do that out of paranoia/mistrust for my wifi network, printer, printer software, etc.

It's probably fine to just print it out, but for more sensitive items I definitely write it down by hand.


It's not as if the printer keeps a hidden cache of printed pages. Except maybe it does...even if the feature was created for entirely benign reasons.

It’s not as if photocopiers could randomly replace letters or numbers, right? …right?

Or perhaps they could: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...


That's one of the instances of "built with good intentions" I had in mind.

I can't tell if this is sarcasm. Handwriting is deprecated now?

2fa backup codes? Yeah, I’d be surprised at people writing those out by hand. They’re long and gibberish, odds of an unnoticed error are high. I’d also be surprised at people typing them by hand (as a way to record them, not to input them) for similar reasons.

Well be surprised. I write them down, by hand.

> They’re long and gibberish, odds of an unnoticed error are high.

That's why you "whitelist" those you wrote down and re-used with success: a little checkbox, which when checked means "Successfully re-initialized an authenticator with this 2FA?", works wonder.

A "dot" underneath a character means it's a number (so I'm sure not to mistake '5' with 'S', for example).

My "paper 2FAs" then go to the bank, in a safe.

I've never ever lost a 2FA access code.


> That's why you "whitelist" those you wrote down and re-used with success: a little checkbox, which when checked means "Successfully re-initialized an authenticator with this 2FA?", works wonder.

I just bake the whitelisting into every 2FA code I handwrite. Instead of scanning the QR into the phone and then writing down the backup, I just start by writing down the backup, and then input it manually from the note into my phone. Once successfully used, I know the handwritten 2FA code is valid.

> A "dot" underneath a character means it's a number (so I'm sure not to mistake '5' with 'S', for example).

That one's good, I'll start doing that from now on! I also found writing letters partially in cursive to help too.

> My "paper 2FAs" then go to the bank, in a safe.

Yep same, I got a bank SD box back in 2017 during my first crypto wave. Have found the $100/yr to be incredibly useful. More recently I've created a sort of "defense in depth" for my passwords/codes. Least important things are available a button click away on Bitwarden Chrome extension, more important things are non-cloud-synced google-authenticator on my phone with 2FA backup in bank SD box. Most important things (i.e. crypto private keys) are sharded into pieces and distributed amongst multiple SD boxes.


I love conversations like this. These are arguably not the most cutting edge or exciting topics but hold a lot of significance and power to make life easier for humans (and machines too).

Some of these are areas of best practices that, when done really well -- people may not even notice it. That's an unfortunate fact of life that comes up often -- where the attention to detail and sincerity that people bring to the table often gets lumped under "obviously it should be that way, nothing special to see or applaud here".


As long as we are pointing out mistakes in the article:

9qg6G8B2Z5SIl170O (ariel)

The name of the font is Arial, not Ariel. (No mermaids here, move along)


Yup... also, a screenshot (or using webfonts) would have probably worked better there. On Linux, most of the lines look the same...

Heads up that the article is open source in case you wanted to contribute an edit.

https://github.com/gajus/gajus-com/blob/main/src/blogPosts/2...

I fixed the typo though. Thanks!


Other prior art is the use of a modified base 58 encoding in Bitcoin addresses.

https://en.bitcoin.it/wiki/Base58Check_encoding


> not only to avoid visually ambiguous characters, but also to avoid spelling words in common languages.

Or you should do the opposite - use real dates/words in ID and your visual confusion almost disappears (though there is a bunch of ambiguity here as well in similar pronunciation, so also not perfect). Humans aren't robots, so shouldn't be forced to read meaningless list of random letters

(example of geospatial system of coordinates based on that is what3words)


Imagine having a coordinate system be owned by a private company.

A few years ago I had to call an ambulance for someone (in the UK) and began giving coordinates, only to hear 'oh do you have what3words it's easier that way' which I found very surprising! I don't love the idea of a proprietary coordinate system either, companies come and go but normal coordinates are universally understood.

You're free to create your own, or not use theirs.

Or we could agree that that's ridiculous and not allow companies to own such things.

Free speech is a right. Interopability should be a right. Any infringement of those rights better gave a damned good reason. It's profitable isn't a good reason.


Dont they have a patent on it?

Yeah.

https://patents.google.com/patent/US9883333B2/en

I think it’s also a good example of increasing computer dependency by ‘human centric’ design: I can quickly and manually sort through a bunch of packages with coordinates or pluscodes written on them with some sense of locality. What3Words is designed to give a sense of familiarity but require an API lookup for every single address.

Letters and numbers also translate directly in most languages, words don’t (take bow as an example. Is it when someone leans over, an archer’s weapon of choice, or a cutesy headpiece?), so the familiarity aspect is limited to people with a good grasp of English.

Its main feature is that it can be commercialized, unlike regular coordinate systems.


> take bow as an example. Is it when someone leans over, an archer’s weapon of choice, or a cutesy headpiece?

Front of a ship, duh.


This post has some overlap with work I did a while back on a "coupon code" system that is optimised for users taking a code printed on paper and entering it into a web form. A number of measures were employed to avoid/correct transcription errors.

Example, docs and links here: https://www.mclean.net.nz/cpan/couponcode/


I wish my parents had access to this when they chose to call me Iain Dooley.

The world has almost unanimously decided my name is now Lain.


For years I thought that Doug McIlroy had a very odd name, until I watched some presentation on YouTube and first heard his name being pronounced – "ah, so that's an i and not a double L!"

Lol I recognise the name from the famous pearls book but always thought his name was your incorrect version.

It probably doesn't help much that both Lain and Lan are fairly famous fictional characters now (Serial Experiments Lain and al'Lan Mandragoran from Wheel of Time).

I think that Iain is the Scottish version of Ian? Is it unacceptable to choose the alternate spelling, Ian?

In an ironic twist, I then get called Lan.

On the plus side, this might help you with networking.

I’d considered it grossly unacceptable to change the first thing gifted to you by your parents.

Your name is your own first and foremost. You can honor your parents in other ways.

Funny story, I was named "Steven" and yet I've been called Steve my whole life, at my preference.

Recently I went through the process of changing my name legally, because I'd fallen into a bad habit of writing "Steve" when asked for my name on some documents, but then remembering my "official" name was "Steven" on others.

Having multiple IDs with different names, especially after moving to a new country, was just too much of a pain - for example my official residence permit name didn't match my passport name, which caused some fun at airports.


The first thing gifted was life, and though that was not bestowed with consent, it's one thing I'd argue for retaining as long as possible. Everything else is fair game to discard in service of making that life a good one.

Eh, there's nothing magical about parental preferences. A loving parent would not want their child to live with a name that they didn't like.

Fortunately with names, there are no returns, but exchanges are accepted (with a low restocking fee) in perpetuity.


I'm an American living in Germany. When I first arrived, the way Germans write the digit 1 surprised me. They write it with the upper hook thing very long, almost like a capital lambda (Λ), which sometimes makes 1 and A visually ambiguous. This isn't really a problem, just something funny about moving to a new country.

I use 1 with a long hook except when I write binary numbers where I use just a | for 1.

I have some other context dependent characters/letters.

I write small z like that in normal writing, but as a mathematical variable I write it as ƶ. (To disambiguate from 2.)

I write small t like † in normal writing, but as a mathematical variable I write it as t. (To disambiguate from + (plus).)

I write q like that in normal writing, but as a mathematical variable I write it with a stroke, which does not display on the iPhone, a ꝗ, a bit similar to a ɋ. (To disambiguate from a (ɑ).)

It’s all about disambiguation, and sometimes having different letter shapes for isolated characters.


my us colleagues regularly mistook the ones for sevens. that's btw why we cross the sevens, like tees and effs

This seems slightly flawed in that it completely removes all members of a similar set rather than normalizing to a single element per similar set.

Thus after normalization, '1lI' would become '111'. This allows you to add seven characters back to the author's code generation alphabet without re-introducing any ambiguity.


If you need more possible values, I agree.

However, if you don't need them, I would remove them so that the user doesn't have to spend any time wondering which character it is. Even though you're processing them all after they type them and fixing them, the user has spent time and effort that they didn't need to, just picking which one it is.

IIRC, I chose to keep them when I did something like this, but I don't think I thought to accept the others and convert them automatically. That project is sunset now, so it's not an issue.


Why not include '1', but make it so '|Il1' all map to the same internal value? That way you have no ambiguity while minimizing alphabet reduction.

I'm not following how your suggestion. It seems like we're saying the same thing?

It only reduces the ambiguity if everyone does the same and everyone knows that you've done it.

If you control the system for generating the codes and the system for verifying the codes (which is generally the case for these kinds of codes), then nobody needs to know you've done anything. It's the same normalizing to upper/lowercase characters when you parse a non-case sensitive code.

Years ago I worked support at an ISP who had usernames which was a 12 digit number. Most regular users and 1st level support do no know the NATO phonetic alphabet. An easy trick is trick is then to read back the number for confirmation but use another grouping of digits. Most users read 1 digit at a time so I would read back 2. One-Two becomes twelve. If they used 2 digits I would for ease use 3 rather then 1. This is a very easy way to do a fake "checksumming" regular people.

Tangent: All number started with 12 which in effect made them 10 digits. They worked together with a banking system and the bank folks thought 10 digits was not secure enough so they complied and added 12 in front of everything.


> Tangent: All number started with 12 which in effect made them 10 digits. They worked together with a banking system and the bank folks thought 10 digits was not secure enough so they complied and added 12 in front of everything.

Delicious malicious compliance - I like it.


I have realized that there is a big design space here, as I recently did a write-up of my take, Id30. 30 bits of information encoded base 32 into six chars, eg bpv3uq, zvaec2 or rfmbyz, with some handling of ambiguous chars on decoding.

https://magnushoff.com/blog/id30/


Related reading, from the font designer's side: “Oh, oh, zero!” by Charles Bigelow (of Bigelow and Holmes, makers of typefaces like Lucida and Wingdings), published in TUGboat the journal of the TeX users group: https://tug.org/TUGboat/tb34-2/tb107bigelow-zero.pdf

(There's also a “footnote” by Donald Knuth: https://www.tug.org/TUGboat/tb35-3/tb111knut-zero.pdf, and follow-up by Bigelow: https://tug.org/TUGboat/tb36-3/tb114bigelow.pdf)


> Related reading, from the font designer's side: “Oh, oh, zero!” by Charles Bigelow

I don't know. People tend to use the letter 'O' a lot. And people tend to use zero '0' a lot too.

Who gives a fuck about "Oh"? I mean, seriously, which percentage of articles, blog, PDFs, webpages, products etc. throughout the world have have 'O' and '0' that can be mistaken one for another? And which percentage have "Oh"?

When was the last time a user had to read a product ID over the phone and did misread big O / "Oh" for 0?

I don't even think there was a last time, because nobody is using "Oh" in identifiers.

While, on the other hand, it's perfectly fine to use a slashed-zero for zero, to be sure nobody mistakes it for the letter 'O'.

So basically: your link and TFA aren't that related.


I'm not sure I understand your comment, because at first glance it seems to be making a distinction between "Oh" and "O", when Bigelow's article is using "Oh" as the name/vocalization of the letter 'O' (as should be clear from the very first sentence, even if not the title).

So, assuming (still not clear from your comment) that you do understand "oh" to mean the letter 'O', as intended, still your comment is surprising, because some of your own other comments talk about O/0, and the submitted post here too starts with that very example:

> What are visually ambiguous characters?

> O / 0 - The letter O and the number 0 can look very similar

So surely the article is relevant to (at least the first example of) the post? I admit it goes much deeper into just this one example, and only a bit into other examples like 1/l/I and 2/Z or 5/S, but still it's relevant and of value as a representative example I think.


An alternative would be to print IDs using https://en.wikipedia.org/wiki/FE-Schrift, which was specifically designed to make normally similar characters to look different.

Good luck distinguishing 0 and O with that font in a random sequence of characters.

The type face you linked is not optimized for humans.

> Its monospaced letters and numbers are slightly disproportionate to prevent easy modification and to improve machine readability.

It's a slightly different issue than what was described in the article (e.g it can't address the cases where IDs are written down).


AFAIK it's designed for non-automatic number plate reading in Germany.

> In some cases, you might also want to avoid characters that sound similar when spoken. For example, b and p can sound similar when spoken out loud. This can be especially important in situations where IDs are communicated verbally.

In many cases these kinds of IDs are just an encoding of a ground-truth that is a big integer or a sequence of bytes, and that mean we don't have to use ASCII-character granularity, we can also use words.

True, that creates a certain cultural bias for wherever you get the words from, but it opens up new possibilities for error correction and detection, both by the computer and also by the humans transcribing things.


Somewhat related, I always liked the concept of https://what3words.com/

what3words has a proprietary implementation and has sent fairly silly legal threats: https://news.ycombinator.com/item?id=27020810

I'll happily boycott that for-profit company which is masquerading as a public utility, but charging money and going after anyone who reverse engineers what words are what locations.

See also the comments in https://news.ycombinator.com/item?id=27058271

This is exactly the sort of thing that shouldn't be a private company, just like Lat/Lon coordinates and street addresses are effectively public domain, any suitable replacement for lat/lon should also be public domain.


Yikes. Well, less of fan now!

They have some pretty bad flaws in their design relating to this topic:

https://twitter.com/jonty/status/1570062564523917312

> the actual address should be "keen.lifted.fired" instead of "keen.listed.fired" and someone clearly misheard over the phone


Yeah, ideally the dictionary first would undergo rather rigorous pruning based on things like phonetic similarity or how easily a typo might move between two valid words.

That scoring/clustering process makes for interesting problems in their own right, especially if one throws accents into the mix.


The problem with words is that their encoding density is much lower, so it requires more space to store. Suppose you create an alphabet A that consists of the N most common English words. Then, what might be Q characters in base 58 would instead require Q*ln(58)/ln(N)*((avg word length in A)+1)-1 characters. For N=1000 and assuming that the average word length is 5, this gives a factor of ~3.5x increase in storage space required (e.g. a 20 character base-58 ID would map to a ~70 character string of words).

That is true. But is it really a storage problem? Could you not store in whatever base-N arithmetic that has high encoding density, and "just" use the words for display/printing and such? Probably it is more a problem of restricting the range of representable numbers because users are unable to handle pages over pages of random words...

Who cares about that much space?

If you do, you're not storing your bits as text to begin with.


You then have to currate a list of words which also don't have similar sounds, are comprised of subwords, aren't offensive, or other gotchas.

I don't think words work well for codes that aren't meant to memorized. They make it harder to currate a unambiguous list since that list needs to be several orders of magnitude larger and the ambiguity can accent dependent. Of course, if memorization may be needed, then that is effort may be worthwhile.

Error detection with codes isn't hard, that's why checksums exist.


There are several wordlists which have been curated this way. -- https://en.wikipedia.org/wiki/PGP_word_list

Thanks, that's a neat resource to making hexadecimal numbers for memorizable and easier to transmit phonetically with some built in error checking from the odd/even list alternation.

However, for the core purpose of the phonetic transmission, it seems needlessly verbose and cumbersome. The short wordlist combines with some fairly long component words to make the phonetic representation unnecessarily long. Additionally, I'm not super into some of the fairly obscure names and words included on that list. If I don't need memorability and hexadecimal atomicity, it doesn't seem worth using.


>we can also use words.

And we do, Bravo for B, Papa for P: https://en.wikipedia.org/wiki/NATO_phonetic_alphabet

Always use phonetic code if you're transcribing letters to someone, especially over phone/radio. It saves a lot of hassle on both sides.

If you don't remember the code, no big deal: For everyday situations, use any easily understood word. Like Apple for A.


On linux you can use Theodore Ts'o pwgen tool with the -B arg.

-B, --ambiguous Don't use characters that could be confused by the user when printed, such as 'l' and '1', or '0' or 'O'. This reduces the number of possible passwords significantly, and as such reduces the quality of the passwords. It may be useful for users who have bad vision, but in general use of this option is not recommended.

>pwgen -B 32 oos9upoVieghuew7aeb3iev3jiequeiw acohthahpie7ae4aeboshahWiengieth yahW3qua3atheeP9jo4aiY3zeepoosh3 Noh4ooth4ohzeec4zug3ephoo7meich7 oozae9Eireix4Chaiboz9dofie4Xunof Mohj3uupee9ahngahh9on9sujee9ehae weimah9aiXeis3owaexei4uh3ibeecai PaeV7eeChaezahruNgeequoh7zok7thi eeJieyah4exiephaiPootei4dokoojoh fohhah3Eec3bah7aeR9iedah7Ve3ea7o vahs4eich4pheisoug9aiR3ohChoh7Ch eth9KaeLahdie7ahy9ohCiebohphuse9 ieye3udumaengai9ies7kae4geeque9T iesoh9eosohthoongaeroo4ehiishohY mee4ohjei4ohmika3taijei3Yaixosei ohWoo4eapid7miebee9pooKai3oofeis Eechook9quohp7se7ees9thaefahb9an aht3quooV4eiph9ap7aiw4wee7oi7eij ishep3weeh7Eero9ohdohth9MietooJ4 Kai9aich9Jee9Angeihee9eehei9esie toonaix4xe3Moob3zaic3Eesahs9ahy3 gaey9doozee7sei9quuPae3vohph4Huo ouYaephahcog3peiw7iecoo7eetheeph eeNgiezae7oongi7uena7eenaezuT7co tai9vuace9eV7Paih7ieN3Ahghiegh3v VaeteeMoobeixai9ingeyahYuzaipaht eeng7vei7pho4Ahpoa4kahgheethahz7 phas4theiThu4uqu7iCh3Aepha3shae3 ieRep3kaideeHeekiNgequieng9raeYo eegahsh9aizooshee9too9oojiox4Lei ovohcaePahM9thaebajuChoo3pipheej oowaimeiWahf4Neighoo3Eeyah3uvi4v vi4choiThei3eisohw4iP9huehohs4oe ukuchiethaquax3hieChouMahpooy4ee aegheeyeemeNeevehud9ohng3dai4jai eth3iedah9Tee3wohneisoo4aicuToos iecap7EeJ7raixiuseesiNou9ooT9fie ied3ooveingu7fu7dahdaaYe9tai7ien eijee7iKighaingaiChei7giemu4chi3 Thie3faih3ahshooRunohwoaghoh4Aev


Also avoid lowercase rn which can be mistaken for m.

And avoiding vowels can help avoid offensive words within a generated code:

FUKFUK9 - https://www.replacements.com/china-fukagawa-fuk9/c/27446

KUNT1 - https://id.made-in-china.com/co_gzberlin/product_Power-Steer...

base32 removes the I,O,U but other words with A,E need to be avoided too - no vowels helps avoid words in English.



Yes, dassic!

Showing cl and d can be hard to discern clifference.

https://www.reddit.com/r/keming/comments/1b2zat4


I'm a fan of z-base-32 for this.

https://philzimmermann.com/docs/human-oriented-base-32-encod...

Command line tool at https://github.com/tv42/zbase32

    $ echo hello, world | zbase32-encode
    pb1sa5dxfoo8q551pt1yw

    $ entropy 16 | zbase32-encode
    y64s31aq6cgjoko9fwbuasf4ce

Doesn't help when you have to match the person's name and they have these characters in them. My name contains the letter "o" and I once had a lot of trouble getting something done at the bank. Multiple staff had to crowd around the computer to figure it out. Eventually somebody discovered that when I had opened my account, that o had been entered as a 0 for some reason and the font they were using, also for printing, showed them looking almost identical.

Similar story here, but a "Q" instead of an "O". The tail of the Q looked like dust on the screen. Somehow I haven't run into issues...

"visually unambiguous dictionary" to the author. It's well known that some people have a hard time distinguishing p/b/d/q.

The Latin/English alphabet is common but not universal. I believe this challenge is why TOTP codes use Arabic numerals. The user's keyboard can type these reasonably. Spoken is always a challenge. Even an English speaking audience will pronounce "0" as zero, oh, or zed.

0123456789 are best called "European", I think, as Arabic numerals would be: ٠١٢٣٤٥٦٧٨٩


"They are also called [..] European digits"

The reference chases through to https://www.unicode.org/terminology/digits.html

    Term:    ASCII digits 
    Example: 0123456789 U+0030..U+0039 
    Explanation/Description:
        Commonly used with Latin, Greek, Cyrillic and many other scripts, including some non-European scripts. Used in alternation with native digits in scripts that have them. (Some scripts with native digits make only limited use of ASCII digits.) Infrequently used in many of the remaining scripts. 

    Synonyms: Western digits, Latin digits, European digits
Which then links on to: https://www.unicode.org/glossary/#european_digits

> European Digits. Forms of decimal digits first used in Europe and now used worldwide. Historically, these digits were derived from the Arabic digits; they are sometimes called “Arabic numerals,” but this nomenclature leads to confusion with the real Arabic-Indic digits. Also called "Western digits" and "Latin digits." See Terminology for Digits for additional information on terminology related to digits.


That's the last entry in the list, so it's not very supportive of the idea that they are "best" called that.

Well sure, but the previous poster was making a proposal ("I think"), and just doing a link dump implies ignorance, which fairly obviously isn't the case.

I think anyone who has dealt with both Arabic numerals (as used in Europe) and Arabic numerals (as used in parts of the Arabian world) feels the naming is unfortunate. Arguably this is not the best place to bring that up, but I certainly stopped using "Arabic numerals" after working with some i18n code which supported both Arabic and Arabic numerals.


Maybe not the best names, but I've taken to calling them Western Arabic, Arabic Arabic, and Persian or Urdu Arabic. I typically only deal with the Unicode representation, so the differences between Persian and Urdu numerals are invisible to me (but very visible if you display them with the wrong language context for the viewer!)

In handwriting there is a difference between European and American. In Europe we don't really have problem with 1 vs 7 or g vs 9. But our nines and ones do look like gs and sevens to Americans.

I heard an American making a joke that

"I have gg problems but European handwriting ain't 7 of them."


A few years ago, I created a system that generates a serial number from a prefix and a 32-bit unsigned integer and fixes up this kind of input error when passing the serial.

https://github.com/pallas/gubbins


I came up with base24[1] for this. There are some letter that can be ambiguous but I kept them to make it case insensitive.

[1]: https://www.kuon.ch/post/2020-02-27-base24/


See also Douglas Crockford's Base 32: https://www.crockford.com/base32.html

This takes the approach of allowing ambiguous characters by decoding them to the same value, and also considers the problem of accidental obscenities.


Interesting, I did different choices:

5-bit base-32 oi23456789 abcdefghkl mnpqrstuvw y

o = 0 i = 1 j, x and z removed.

I like that you can fit 6 characters in an 32-bit integer and still have to bits to spare... makes for compact usernames and network bandwidth.


I believe an implementation is here:

https://godocs.io/encoding/base32



UuidExtensions[1], a C# library, has a way of generating / encoding IDs that has several useful properties:

1. IDs can be generated anywhere (client-side, server-side, etc.) and are still unique 2. IDs are ordered by time 3. IDs don't use L and O because those can be confused for other characters

I've found it very handy in my travels.

[1] https://github.com/stevesimmons/uuid7-csharp?tab=readme-ov-f...


Modern bitcoin addresses use a base-32 character set that leaves out some of the most ambiguous pairs and also permutes the address ordering so that the most visually similar remaining characters produce single bit errors which are better handled by the addresses error detecting (and potentially correcting) code.

https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi...


Recently I came up with something similar: https://gist.github.com/ceving/cb68c8f2392255c5ed4ea65a6a199...

But I use a alphabet with 32 characters: abcdefghikmnopqrstuvwxyz23456789

I prefer 32 characters, because that makes it possible to pack 5 random bytes into a token with 8 characters.


“Oh By”[1], The universal shortener, has had protections for this built in from the very beginning.

Since the whole point is the ability to convey a message in the physical world end with chalk or pencil or whatever – we needed to make sure that characters were unambiguous.

So there are no zeros or ‘o’ characters or ones or ‘l’ characters… I think there were one or two other rules that govern this but I can’t think of them right now…

[1] https://0x.co


Honestly, stuff like this is why I stick with (case-insensitive) hexadecimal for user-facing IDs. I find hex to be the sweet spot between "decently sized alphabet to keep ID lengths down" and "easy to read, communicate, and enter manually". It's also fairly resistant to accidentally generating IDs which will offend your users (unless your users are 1337-speaking time-traveling pre-teens from 2002 who are going to snicker at "b00b5"), which is a nice perk.

Also do not use the same character repeated in a "long" sequence. I hate this with IBANs. Too often there's something like '000000' right in the middle of an IBAN and in case copy and paste is not possible I end up counting the number of zeroes at least thrice. Groups of four characters separated by spaces would help in this case but that's another topic.

I did my PhD on (malicious) visual impersonation of domain names using many of the techniques described here. There are many references to other visual doppelganger techniques included in my paper here: https://par.nsf.gov/servlets/purl/10256904

My research focused solely on the .com domain name space, so our character set was limited.


that research paper only considers ascii characters in domain names?

The paper only considers .com domain names, which have a limited character set support, discussed in RFC 1034 https://www.ietf.org/rfc/rfc1034.txt

Essentially A-Z, 0-9, and the - character, and domain names can not start with the dash character.


An approach we are trying is speakable IDs. Three characters for the type of thing, then four random words from a list of clean words with 5 characters:

xxx_flown-moons-deary-flake


Several hard-to-mess-up wordlists have been standardized. -- https://en.wikipedia.org/wiki/PGP_word_list

You'll want to be careful to consider homophones while also taking accents into account. E.g. if your dictionary contains "deary", it probably shouldn't also contain "dairy".

Or “dreary”, or “dear”, or “deer”. Unfortunate choice for the example!

Our approach is to only use 5-character words.

Great point!

This introduces a new type of risk - if it can be interpreted as a sentence, "moons" as a verb isn't really a clean word.

> I would be wary of excluding characters just because they look like other characters when combined

I wish the author would have said more about this. Why be wary?


The implied reason is that it shortens the list of available IDs substantially.

That was my first thought, but the section on case sensitivity already discussed the impact of a reduced alphabet and pointed out adding more characters takes care of that quickly. So I assume the reason is something else.

This is why I only ever use xterm with the default bitmap font, it's literally the only one where I'm absolutely sure which character is which.

Telephone equipment avoids the letters i and o in the alphabetical designation sequence for this reason, they look like numerals 1 and 0.

And TFA doesn't even mention Unicode, scripts, ASCII, Latin, nothing. As you can imagine it all gets much worse with Unicode (though through no fault of the Unicode Consortium). See Unicode TR#39 [0].

  [0] https://unicode.org/reports/tr39/

It would be helpful to also add a screenshot for that font overview, because: https://imgur.com/a/h7Ks1Qj

And even on systems which do have these fonts, they may not always be exactly the same.


> However, as the number of members in the set increases, the number of possible IDs increases exponentially. Case-sensitive: 53^8 = 62,259,690,411,361 Case-insensitive: 22^8 = 54,875,873,536

Nitpick, but isn't this polynomial to the members of the set?


aⁿ grows (a/b)ⁿ as quickly as bⁿ. The multiplicative difference still grows exponentially in n.

a^n is polynomial in a and exponential in n.

This is why longer password are more efficient than complex passwords: to gain the same security effect of doubling the password length you would need to square the alphabet


You've the proper definitions but are missing the context. An exponential with larger base still has an exponential multiplicative difference compared to an exponential with a smaller base.

We're comparing the growth rate of of two exponentials representing variable-length identifiers. We're not looking at a constant-length identifier (which is what you're doing with only looking at a^n). Notice the context of where exponential is used in the article: we are changing n from 5 to 8.


>Avoiding Confusion With Alphanumeric Characters

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3541865/


I suppose the first line of defense is a QR code URL. I don't think anyone really enjoys typing long codes.

After that there's ECC. A few extra bytes for a reed-solomon code will fix a lot of issues.


Letters l and I are visually indistinguishable when written in Arial.

We could always use 1s and 0s, maybe group them in eights. Tongue in cheek, but I guess that would be a valid (even if extreme) solution.

How come neither v nor u are in the final set?

They’re not even mentioned and don’t look like a thing else, except maybe each other in some typefaces.


vv ~ w

Aha! Of course!

Works on words and special characters, too. I just skimmed the comments and had to scroll back up to verify that I had NOT just read "Anal Of course!"

Haha. You see what you want to see I guess ;)

> When it matters?

This applies to usernames too! It's easy to phish if platforms render capital I and lowercase l the same


If we include handwriting then lowercase n and u get be hard to distinguish if written in cursive

A friend told me about how his work had some senior IT mgrs, who'd clearly been playing with their iPhones too long, decide that the firm shouldn't use Ids at all any more, and started pushing this without consulting the business, even though it was totally inappropriate given how widely they were needed... Caused mayhem and needles arguments!

My OCD approves of this idea. Let’s also add, IDs cannot start with 0 or O.

I have one on my passport number. I still dont know which it is so I alternate. Hasn't been a problem to anyone yet when registering for planes and crossing borders: the picture is clear, it can be both lol

Both are visually ambiguous, so we are good.

just use numbers and crossbar your 7s - problem gone.

if someone's writing is incompetent tell them. if you can't then they ruined it for themselves by being shit at writing the number 7.


my work id has a 0 and a O in it and it drives me crazy. i only remember it due to muscle memory on the keyboard

Another confusing thing is doing this:

    xxxxx-xxxxx-xxxxx-xxxxx
Instead of something like this:

    xxxxx-xx-xxxxx-xxx-xxxxx
Something could also be said about such scheme lacking the embedding of a checksum.

Here's an IBAN (bank account number) in the EU (which thankfully are using a checksum as part of the account number):

    LU29 0022 1712 5582 7000
      ^^
      ||
      two checkdigits
Also some companies think they're "smart" because they pick numbers like this:

    LU29 002 0000 0001 8000
Repeating the same digit, usually a zero, a shitload of time ain't smart. It's fucking dumb.

I guess I better stop using Bozos_Gismos

Four quick thoughts:

- We haven't solved this already? Who hasn't tried to read some code and couldn't tell O from 0 or l from 1, etc.?

- Aside from ambiguous characters you have to be aware of spelling and leet spelling. e.g., 53X, S3X, 5EX, etc.

- FFS stop with the 10+ character strings without spaces or hyphens. There's no reason for that.

- Not everyone has perfect vision. Ambiguous characters *and* less than perfect vision (often with not spaces / hyphens) is a mortal UX sin.

We've all been on the wrong end of these, and yet they are common enough - in 2024??!!? - that they need to be mentioned here.


cl looks like d in some fonts or with bad kerning

Out of curiosity, anyone knows why would this post be removed from the front page?

I was excited see that the post is getting engagement. I saw it in 3 position. Then checked an hour later and it is nowhere to be seen.

I am assuming this is some sort of opportunistic algorithm at play that gives a chance to a post, but removes it if it is not performing, but curious if anyone has more details.


HN submissions tend to be in the front page when they receive a bit of early votes within (roughly) the first hour, but they disappear rather quickly without further votes. Given that this submission was only 3 hours old when you posted this comment, it is quite expectable. (For the record it's now in the fifth place, suggesting that it has eventually received enough votes to stay in the front page later.)

Makes sense. I am just curious about the logic behind the algorithm, more than anything else.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: