This discussion came up in mozilla.dev.identity – how do you make a decent password strength meter? Now it could be that someone’s already done this (links?), but I’ve never been embarrassed about reinventing the wheel, so here are my thoughts.
IMO, most password strength indicators suck. They give a fixed bonus for adding punctuation or numbers or upper-case letters, and you can’t have a strong password until you have several of those categories. Therefore, to take random examples, a lot of them think “correct horse battery staple” is a worse password than “Tr0ub4dor&3″.
Inspired by xkcd, here’s a straw-man proposal for a Unicode password strength meter which avoids some of the obvious flaws, while still not being overly-complex to implement. Note that this is about strength (resistance to brute force attacks), not about memorability or anything else.
- Classify every code point by its Unicode script. (The data needed for this is not large, as most scripts are in contiguous ranges.)
- For each script used, take the number of commonly-used characters (this would be a predefined lookup table), and add the values together to make an “entropy” value, which is a rough proxy for the size of the character space from which the password’s characters were taken. So e.g. an Arabic numeral is 10, an unaccented Latin letter is 26, an uppercase Latin letter is 26, a Chinese character would be about 20,000.
- Multiply the entropy by the password length in characters.
- Make sure it’s over a certain threshold, which can vary depending on the application. You might use 300 for web forum membership login, and 1000 for a bank. One could develop recommendations.
e.g.
“Tr0ub4dor&3″:
26 (lower-case Latin letter)
+ 26 (upper-case Latin letter)
+ 10 (Arabic numeral)
+ 15 (simple punctuation)
= 77
77 * 11 = 847
“correct horse battery staple”:
26 (lower-case Latin letter)
+ 15 (simple punctuation)
= 41
41 * 28 = 1148
Now, the flaw of this proposal is that the measure assumes all the password characters are independently chosen. Perhaps the way to solve that is to add or multiply a bonus for “script transitions” – letters to punctuation, one alphabet to another alphabet, etc. Because words, the most common case where successive characters are not independent of each other, are most often all one script.
“Tr0ub4dor&3″ has 7 such transitions, “correct horse battery staple” has 6. But “intelligentsia”, while being long, has none.
Thoughts?