How To Make A Decent Password Strength Meter?

This discussion came up in mozilla.dev.identity – how do you make a decent password strength meter? Now it could be that someone’s already done this (links?), but I’ve never been embarrassed about reinventing the wheel, so here are my thoughts.

IMO, most password strength indicators suck. They give a fixed bonus for adding punctuation or numbers or upper-case letters, and you can’t have a strong password until you have several of those categories. Therefore, to take random examples, a lot of them think “correct horse battery staple” is a worse password than “Tr0ub4dor&3”.

Inspired by xkcd, here’s a straw-man proposal for a Unicode password strength meter which avoids some of the obvious flaws, while still not being overly-complex to implement. Note that this is about strength (resistance to brute force attacks), not about memorability or anything else.

  1. Classify every code point by its Unicode script. (The data needed for this is not large, as most scripts are in contiguous ranges.)
  2. For each script used, take the number of commonly-used characters (this would be a predefined lookup table), and add the values together to make an “entropy” value, which is a rough proxy for the size of the character space from which the password’s characters were taken. So e.g. an Arabic numeral is 10, an unaccented Latin letter is 26, an uppercase Latin letter is 26, a Chinese character would be about 20,000.
  3. Multiply the entropy by the password length in characters.
  4. Make sure it’s over a certain threshold, which can vary depending on the application. You might use 300 for web forum membership login, and 1000 for a bank. One could develop recommendations.

e.g.

“Tr0ub4dor&3”:

  26 (lower-case Latin letter)
+ 26 (upper-case Latin letter)
+ 10 (Arabic numeral)
+ 15 (simple punctuation)
= 77

77 * 11 = 847

“correct horse battery staple”:

  26 (lower-case Latin letter)
+ 15 (simple punctuation)
= 41

41 * 28 = 1148

Now, the flaw of this proposal is that the measure assumes all the password characters are independently chosen. Perhaps the way to solve that is to add or multiply a bonus for “script transitions” – letters to punctuation, one alphabet to another alphabet, etc. Because words, the most common case where successive characters are not independent of each other, are most often all one script.

“Tr0ub4dor&3” has 7 such transitions, “correct horse battery staple” has 6. But “intelligentsia”, while being long, has none.

Thoughts?

27 thoughts on “How To Make A Decent Password Strength Meter?

  1. Thinking about the “entropy” calculation – so, your goal is to assign a cost to brute-force recovery, which makes a lot of sense. Especially since the most likely threat model, these days, is a rainbow table attack on a recovered (hashed, maybe not salted) password database.

    But assigning a value greater than 256 for exotic code points presumes that the attacker is going to actually enumerate valid Unicode characters. I wonder if a Unicode-savvy attacker would instead just run through all valid UTF-8 byte strings, without reference to which page the resulting code point lands?

    If that is the case, then what you’re really capturing is the idea that a single Unicode character may have more than one byte of entropy. Perhaps a simpler algorithm would simply count the bytes in the UTF-8 representation?

    A second point – and this is quite possibly out of scope for the problem you’re trying to solve – is that the most dangerous thing a user can do is reuse a password across sites that are keyed with the same user identifier. So it would be nice if my password strength meter also gave me some scary feedback if the password matched something in my browser’s password DB. Obviously more like a Firefox feature, that.

  2. The down-side to using transitions as a measure of hardening is that it will still score “P@ssw0rd” pretty well: it has upper- and lower-case Latin letters, an Arabic numeral, and simple punctuation (616 points), and makes 4 transitions between them.

    I’d think generating passwords in Firefox itself might be a good way to hit the big 3 tests (that I, a non-expert, am aware of):

    * High entropy
    * No reuse between sites
    * Does not exist in known (or at least common) password dictionaries/rainbow tables

    That last bit might be a killer feature, actually…

  3. Pingback: An etropy based password strength checker in a data URI « Flagfox

  4. I think it’s time to admit that people are using variations of stuff in dictionaries in their passwords, as pass phrases are advocated as hell.

    Pass phrases actually decrease entropy, as you can narrow down the entropy of “letters at the first place of a word”, “letters at the second place of a word” etc.

    Like, there’s no English word that’s starting with Bbbb, so there’s a whole lot of combinatorics that can be just ignored in brute-force attacks. Or at least tried much later in the attack.

  5. > the most dangerous thing a user can do is reuse a password across sites…

    While you’re right, just displaying a notice telling the user not to do that isn’t enough. You’ve highlighted the problem, but the suggested solution (remembering a strong, unique password for every site) just can’t be done by the typical human.

    So what are the solutions? There’s a few to choose from:
    1. Reusing a set of passwords across similar sites, grouped in a way that makes it easy for you to remember. It’s not perfect, but if your password gets compromised you’re limiting the potential damage. The most dangerous sites (e.g. online banking) would get their own passwords.
    2. Hashing the password yourself – i.e. adjusting a common master password with some piece of information from the site. Effectiveness is highly dependent on your choice of “hashing algorithm” – simply adding the domain name to the end of your password isn’t going to be tricky to reverse engineer.
    3. Writing all your passwords down in a paper book you keep next to your computer. Great protection against the typical cyber threats (hackers, viruses etc), but you’re effectively telling the passwords to anyone who knows about the book (can you trust your friends and colleagues?)
    4. Keeping all your passwords in an electronic safe. Allows you to have the passwords as complex as the site allows (effectively turning them into shared keys), but it’s a great prize if the safe is cracked, and the software I’ve seen is very dependant on the clipboard, which can be intercepted by other software.

    Services like OpenID and Mozilla Personas do play a part, but they need to be implemented by the website, so are out of the consumer’s hands.

    > …that are keyed with the same user identifier
    I’d broaden that to that share any common information. That might be email address, phone number, being the 5th person tagged in the same photo where the four other people are already known. Basically there are too many ways to mine the data for different identifiers to be good protection against a determined opponent, although you’ll probably beat the spammers who are just trying to send the same message through as many legitimate accounts as possible.

  6. Anything that doesn’t use a dictionary is going to be a terrible password strength meter. It will rate “password123” (= zero security) as having the same security as “ysshlzag241” (= decent security).

    What I’d suggest doing is looking at real-world black-hat password crackers. A while ago, I had a friendly chat with a hacker who stole my site’s password database and posted some hidden ads for a while to make a quick buck, and whom I was asking to please not post my site’s password database on BitTorrent for the lulz. He told me, quote: “Keep in mind hybrid-dictionary attacks (word-mutations) are extremely effective. I only use brute-force to clear small passwords.” I don’t know what exactly this means, but I’m guessing it means to try variants on dictionary words rather than bothering with actually brute-forcing long passwords (which is pointless). If you require all passwords to be 16 characters, hackers will only target passwords that are 16 characters or longer, and the very first things they’ll try are things like “aaaaaaaaaaaaaaaa” and “passwordpassword”. You won’t have increased security much.

    What you really want here, abstractly, is to manipulate users into adopting a password distribution algorithm wherein no single password has more than negligible probability of being chosen. Identifying the current distribution that users use and banning the most common passwords from it will only give rise to a new distribution that may well have passwords that are every bit as vulnerable. That will only work until hackers catch on to the features of the new distribution. In that sense, merely requiring length is not going to solve any problems, unless it happens to push users to use harder-to-guess patterns (which it might).

    The Coding Horror post is not helpful, because it completely ignores anything other than simple brute force, which no one will try for harder passwords.

    I note this critical sentence in the CyLab paper that bastiaan links to: “It is important to note that 16-character minimum policies are rare in practice. Hence, current guessing algorithms, including the Weir algorithm, are not built specifically with them in mind. Although we do not believe this affects our overall findings, further investigation would be beneficial.” I wonder whether simple length requirements would really hold up well if they became better established. The paper is certainly informative, though, and argues that length should be a big consideration.

    The INRIA paper that Devdatta links to seems mostly concerned with online passwords, and doesn’t seem so relevant to in-browser stuff that doesn’t send the passwords home.

    The Dropbox blog post that jyo links to may or may not be theoretically secure, but it sure looks a lot better than most password strength checkers I’ve seen!

  7. Surely running through all valid UTF-8 byte strings doesn’t get you anything? If you really mean _all_, then the search space is immense. If you are restricting it, then you are thinking about characters once again.

    There are (say) 20,000 Chinese characters in every-day use. If you know a user’s password is 3 randomly-chosen Chinese characters, that’s 20000^3 possibilities, whether you think about them as 3 characters or 9 UTF-8 bytes.

    I’m not sure it’s actually a security maximum to force users to eliminate password duplication. I certainly can’t say I have no password duplication across the hundreds of websites where I have a login.

  8. He writes: “I used oclHashcat-lite with the full range of a common US keyboard – that is, including uppercase, lowercase, numbers, and all possible symbols:”

    His conclusion is only valid when the characters in your password are chosen from a set of about 80 or 90. If your password is in Chinese, then a shorter one may well be equally or more secure. Hence my algorithm’s attempt to take into account the range of characters you may be choosing from.

  9. Having read the Dropbox blog, I think you probably also need a common-password-list-based check as well. That would exclude stuff like P@ssw0rd.

  10. Interesting.

    Tr0ub4dor&3: 1.83 years

    “correct horse battery staple”: 1.24 hundred trillion trillion centuries

    I note that replacing an “e” with an “é” doesn’t change its opinion of the size of the search space.

  11. Pass phrases do mean that each letter depends to a certain degree on the one before it, but that disadvantage is more than offset by the increased length.

  12. If you wan to make a password strength meter that’s actually halfway decent, it’s going to have to do a significant amount of pattern analysis, looking for common ways in which patterns are constructed.

    Ultimately, here’s the formula you want: for each “element” in the password, calculate the number of possible elements in the space. Multiply these numbers together, and finish up by taking a logarithm of the resulting product. (Protracted arguments could be made about which logarithm to take. I tend to use the natural log, because I have a background in math; but the log base two gives you the number of “bits of entropy”, which is a fairly common measure. Ultimately, it doesn’t matter as long as you’re consistent.)

    I’ll run the two example passwords through the formula manually. I’ll be ignoring the fact that both examples are in fact widely publicized example passwords. A real password strength indicator SHOULD carry a list of common ones and assign them a much worse score (say, the log of the number of common passwords on the list). But in order to show how to calculate the scores of arbitrary passwords…

    Tr0ub4dor&3 consists of three elements: A medium-grade dictionary word (“troubadour”) with one minor misspelling and two common substitutions; a common punctuation mark; and a numeral. The word “troubadour” is sufficiently out of the ordinary to warrant a full-sized spelling dictionary (say, eighty thousand entries). The misspelling is a simple omission, and there were ten letters in the word originally, so there were at least ten possibilities for omitting a letter, but I’d assign the misspelling a value of at least twenty, on the grounds that there are a couple of other common ways of misspelling a word besides omitting a letter. The first substitution then could’ve been performed on any of nine letters, with 2-5 common options for each letter (it varies depending on the letter; let’s say 3 on average), so we’ll call that 27 possibilities. The second substitution only had 8 letters left to work with, so call that 24. So the number of possibilities for the first element, the mangled word “Tr0ub4dor”, is somewhere around 80000 * 20 * 27 * 24, or about 10^9 possibilities. The ampersand isn’t a period or underscore or hyphen or space (the MOST common punctuation marks in passwords), but it’s in the next category up, ones that can be easily typed on a US-English keyboard. Call it thirty possibilities. The one-digit number has ten possibilities. I suppose we should give it credit for capitalizing the T — a rather underwhelming feature, but let’s count it anyway: multiply by two (two possibilities: either the word is capitalized on the first letter, or it’s not). Altogether, 80000 * 20 * 27 * 24 * 30 * 10 * 2 = 622080000000. So the natural log is about 27.1. (If you use the log base 2 instead of base e, you get 39 bits of entropy. If you use the common log, you get 11.8.)

    The second password, “correct horse battery staple”, consists of four common words strung together with spaces. Since the spaces are all the same character and, furthermore it’s one of the most common non-alphanumeric choices (along with hyphen and underscore and period), we’ll assign the spaces a value of 4, collectively. The real entropy here comes from the words. Those are all fairly common words (much more common than “troubadour”), so they could easily come from a small, gradeschool-type dictionary — say, twenty-five thousand entries. 20000 * 20000 * 20000 * 20000 * 4 = 640000000000000000. Taking the natural log gives us a score of 41.0. Using the log base two instead, it has 59 bits of entropy. A common log would assign it a value of 17.8.

    So, comparing the two:
    Tr0ub4dor&3: natural score 27.1 (39 bits, common score 11.8).
    correct horse battery staple: natural score 41.0 (59 bits, common score 17.8).

    The tricky parts of this process are A) splitting the thing into elements and B) deciding how many possibilities there are for each element. Of these problems, A) is significantly more difficult if you intend to accurately handle passwords that contain “common substitutions” (e.g., Tr0ub4dor), because your software has to somehow figure out that this is a mangling of the word “troubadour”. This is where the pattern matching comes in. Humans “just see” this stuff intuitively; computers have to work rather harder to figure it out. The main complicating factor for part B) is deciding how large a dictionary to claim. Password crackers are known to work in layers, trying smaller dictionaries first and then moving up later to larger ones (skipping the words already tried, of course; the dictionary files they use are constructed to make this automatic). If you could take a really large dictionary and somehow sort it by word frequency, you could use a word’s position in the frequency-sorted dictionary to construct its (unmangled) score.

  13. I’ve seen more and more this self-impose self-censorship.

    Either you find the word acceptable or you don’t. Censorship is immoral and dishonest but this type of censorship is also plain stupid.

    Whatever the meaning of those words it is carried by any substitute. So shit is unacceptable but excrement is and so is the coward’s way **** or !$%&€.

    If people find those words “offensive” is because of their meaning and their meaning can be reproduce using the most harmless words.

    But “offensiveness” is subjective and therefore not a measure of anything. They may find those words “offensive” but their right to not see or hear them does not superimpose on everyone else right to Free Speech. What they do is bullying is you should not follow the vocal hypocritical fanatics.

    Consult for example the OED (Oxford English Dictionary) where you can see that the name Jesus has been corrupted into more than 100 different ways of what is called “profanity”.

    People didn’t give their lives in past in defense of freedom so that we surrender it to a bunch of immoral and dishonest people.

    Just say NO! to censorship. Say NO! to self-censorship!

  14. Hello Aryeh,

    Our paper on measuring password strength can still be used in an off-line setup. Instead of using a per-website n-gram distribution, one would use a generic n-gram distribution. You would only need to have a few hundred thousand passwords taken *somewhere*. Split those passwords in the constituent n-grams. Add noise and ship the n-gram distribution together with the browser. Adding the right amount of noise guarantees that the original passwords’ security is not compromised when you publicly distribute the n-gram database.
    When using 3-grams the size of the distribution that needs to be stored in the browser is modest and in the order of a few megabytes.

    This is of course assuming you can legally find passwords that can be used for this purpose. Any leaked password database would work wonders. However, not being a lawyer, my guess is that you cannot use any of those leaked databases for this purpose. You would need to find other sources of passwords that can be legally used for this purpose. Again, the security of the original password DB would be maintained by the addition of noise, up to a certain factor discussed in the paper.

  15. Substitution of non-letters for letters is also used as a way of mincing four-letter words. john presumably didn’t follow the discussion about password strength and simply reacted to the superficial encoding of words via that method and assumed it was related to the phenomenon with which he was familiar.

Leave a Reply

Your email address will not be published. Required fields are marked *