Qualys Blog

www.qualys.com
2 posts

Discovered Patterns in Numeric Passwords Raise New Questions

In my previous article, I showed how a powerful tool like John The Ripper can crack a few million passwords mainly using a dictionary attack strategy.

As pointed out by Jeremi Gosney on the Security Nirvana blog, it is impossible to write a “Top Passwords” post for the LinkedIn breach because the leaked list only contained unique hashes. That being said, through the analysis of password patterns and the discovery of a few common tendencies, we can see how bad humans are at random password generation by focusing on a group of passwords easily cracked by an incremental attack: numbers.

Numeric Passwords

Six Digit Password Heat MapNumbers are easy to crack, and out of the 6.5 million passwords exposed in the recent LinkedIn breach, I was able to crack just over 200,000 numeric passwords (i.e., passwords consisting only of the numbers 0 through 9). Of those 200,000 numeric passwords, 93,340 contained 6 digits and 55,027 contained 8 digits – roughly 75% of all numeric passwords found.

Choosing a purely numeric password is usually a horrible idea, because even for 8-digit numeric passwords, it only takes a few seconds to generate the SHA-1 hashes for the 100 million (10^8) possible combinations. By comparison, using a combination of lower-case and upper-case characters plus numbers and 10 special characters yields 7.2 * 10^14 possible combinations (i.e., it takes 7.2 million times longer to generate hashes of all possible combinations). 

One of the wonderful properties of numbers is how easily they can be represented graphically. As we can’t know which passwords are duplicates, we focused on the distribution of the numerical passwords and chose a method to enhance the patterns.

In the heat map shown above, the 93,340 6-digit passwords that I cracked are represented. The x-axis represents the first two digits from 00 to 99, and the y-axis represents the last four digits from 0000 to 9999.  In other words, each column stands for a single prefix (from the left to the right 00-99) and each row represents a “window” of 100 suffixes (from the bottom 0000-0099 to the top 9900-9999). Thus, each pixel of the graphic represents 100 possible passwords.

For example the lower left square represents 000000 through 000099; the square to the right of it represents 010000 through 010099; and the top left square represents 009900 through 009999. The color of each square represents the number of cracked passwords within that range with blue representing very few cracked passwords in that range, and red representing many in that range, as shown in the legend below. Note that the color distribution in the heatmap is not uniform — color values are shifted based on other values in the same row and column to make the patterns more obvious. Heat maps were generated using R.

Heat Map Legend

Patterns Extracted from Numeric Passwords

If users selected their passwords randomly, then the passwords would be distributed evenly with each pixel representing about 9 passwords (93,340 divided by 10,000 squares for our 100 x 100 heatmap). In this case the heatmap would be a uniform color with some “noise.” However, users don’t select passwords randomly, so we can discern areas with higher and lower concentrations of passwords.

First we spot some boxes on the bottom left corner and two lines: one vertical, one horizontal, which I will discuss later. We clearly see a bottom-left-top-right diagonal that represent the passwords composed from three repetitions of the same two digits : 313131, 424242, etc. If we look into the raw data, we can see that all 100 of them are present (and we can guess that some of them are probably used by more than one LinkedIn user). At this point, any attacker can guess that this pattern is also used for letters, as well (for example, by analyzing the cracked non-numeric passwords, I found 474 of the 676 (26^2) possibilities of lowercase letters like ababab using this pattern).

Six Digit Password Heat Map with Highlighted RegionsLet’s take a look at the areas and lines in the bottom left zone of the heatmap, and let’s add some color in order to identify them:

In the green area, we can notice all the dates represented as DDMMYY. If we look closer, we can also see the months (rows) with less than 31 days have darker pixels on the 31st column. This also holds true in the diagonally symmetrical red area, which represents MMDDYY. With those two date formats, we can find nearly 40% of the passwords in their corresponding areas, although these cover only 6% of this heat map. The horizontal pink line is composed of numbers finishing with a 4 years digits of the 20th century (i.e., 19xx), and the vertical pink line symbolize the numbers beginning with 1955 up to 1998. The yellow area corresponds to the dates represented as YYMMDD from 1955 to 1999.

When any password specialist sees these patterns through the cracked list, it then becomes obvious to generate a dictionary of dates as more complex strings like “November 4, 2011,” which would have been harder to crack in an incremental mode. Several tens of thousands passwords in that leak are indeed various date formats in various languages.

Spotting Password Singularities While Watching Numbers

It is also possible to represent 8-digit numeric passwords in a heat map, as shown below. The x-axis represents the first four digits from 0000 to 9999, and the y-axis represents the last four digits from 0000 to 9999. To keep the graphic to a reasonable size, we show a 200 x 200 pixel heat map, where each column represents a “window” of 50 prefixes and each row represents a “window” of 50 suffixes. Each pixel of the heatmap then represents 2500 possible passwords.

Eight Digit Password Heat Map with Highlighted RegionsAgain, the obvious diagonal pattern is any password chosen as a repetition of four decimal digits: 42154215, 36713671, etc. This is also a good pattern for cracking alphanumeric passwords: more than 19,000 passwords, e.g. abc1abc1, follow this pattern in the LinkedIn data.

The yellow area displays every password of the format DDMMYYYY for each year from 1900 to 1999; the small green area shows every password of the format YYYYMMDD. These two little areas contain 25% of the 8-digit numeric passwords.

As I was looking at this heat map, I noticed an abnormal line in the upper right corner (the little pink box on the graphic). At first, I thought it was a 69xxxxxx pattern (users’ affinity for the number 69 is pointed out in a paper on customer-chosen banking PINs), but it wasn’t. After a closer look at the actual values from the cracked passwords, I found a weird range of nearly complete sequential list of 423 numbers from 67108865 to 67108899 and 67109000 to 67109397. At the time of writing this post, I’m waiting for the validation from LinkedIn that the accounts related to this range may have been automatically generated in some way.

Again, focusing on the results of an easy incremental attack of the passwords, we can discover various strategies (word repetitions, date generations, etc.) that can be reused to crack tens of thousands of more passwords, including strong ones, without using rainbow tables or a powerful CPU/GPU. At the same time, we can find that passwords sometimes reveal some secrets (apart from pure demographic analysis that could have been done on birth dates of LinkedIn users), like that bizarrely long list of numerical passwords.

New Questions

The obvious conclusion, which has been repeated many times elsewhere, is that humans are not good at generating secure passwords. No matter how clever we think we are, we always pick passwords based on some reasoning, like the dates or repetition patterns shown above, and that reasoning can be discerned and used to crack the password. Only truly random passwords are safe from this type of method.

Personally, I’m interested in the person(s) who stole and released the LinkedIn data. Hypothetically, if we are able to unveil information from the released password hashes, and if the hashes that have been “marked” with five 0’s in their beginning are those already cracked by an initial hacker, I am wondering if someone would be able to make a profile of this person by looking at the success rate of different hacking methods such as rainbow tables, incremental mode, dictionary attacks and so forth.

Lessons Learned from Cracking 2 Million LinkedIn Passwords

Like everyone this week, I learned about a huge file of password hashes that had been leaked by hackers. The 120MB zip file contained 6,458,020 SHA-1 hashes of passwords for end-user accounts.

At first, everyone was talking about a quick way to check if their password had been leaked. This simple Linux command line:

echo -n MyPassword | shasum | cut -c6-40

allows the user to create a SHA-1 sum of his password and take the 6th through 40th characters of the result. (See note below*). Then the user could easily search the 120MB file to see if his hash was present in the file. If it was, then of course his password had been leaked and his account associated with that password was at risk.

John the Ripper

But when the OpenWall community released a patch to run John The Ripper on the leaked file, it caught my attention.  It has been a long time since I have run John The Ripper, and I decided to install this new, community-enhanced "jumbo" version and apply the LinkedIn patch.

John the Ripper attempts to crack SHA-1 hashes of passwords by iterating on this process: 1. guess a password, 2. generate its SHA-1 hash, and 3. check if the generated hash matches a hash in the 120MB file. When it finds a match, then it knows it has a legitimate password.  John the Ripper iterates in a very smart way, using word files (a.k.a. dictionary attack) and rules for word modifications, to make good guesses. It also has an incremental mode that can try any possible passwords (allowing you to define the set of passwords based on the length or the nature of the password, with numeric, uppercase, or special characters), but this becomes very compute-intensive for long passwords and large character sets.

The fact that the file of hashed passwords was not salted helps a lot.  As an aside, even if they were salted, you could concentrate the cracking session to crack the easiest passwords first using the "single" mode of John the Ripper. But this works best with additional user information like a GECOS, which was not available in this case, at least to the public. So the difficulty would be much greater for salted hashes.

Approach

In my case, I have an old machine with no GPU and no rainbow table, so I decided to use good old dictionaries and rules.

I ran the default john command that just launches a small set of rules (like append/prepend 1 to every word, etc.) on a small default password dictionary of less than 4000 words. It then switches to incremental mode based on statistical analysis of known password structures, which helps it try the more likely passwords first. The result was quite impressive because after 4 hours I had approximately 900K passwords already cracked.

But then, as it got to the point were it was trying less and less likely passwords and therefore found matches more slowly, I decided to stop it and run a series of old dictionaries I had: from default common password lists (16KB of data) to words of every existing language (40MB of data). It was very efficient and found 500K more passwords in less than an hour, for a total of 1.4M passwords.

Even though my dictionaries were 10 years old and didn’t contain newer words like "linkedin", it appeared that some cracking rules, by reversing strings or removing some vowels could guess new slang words from already cracked passwords.

And as I had just acquired 1.4M valid passwords, I believed that using these newly discovered passwords as a dictionary I could find more. It worked and the rules applied to the already cracked passwords produced 550K new ones. I ran a second iteration using the 550K passwords from the first iteration as a dictionary, and found 22K more. I iterated in this manner a total of ten times.

It is interesting to see that the most elaborate passwords found in the 3rd or 4th iteration of this kind of recursive dictionary cracking were related to the word linkedin most of the time:

If I tried to match the word linkedin slightly modified (reversed or with '1' or '!' instead of 'i' like in l1nked1n):

  • In the first iteration, 558 passwords found in the 554,404 (0.1%) are related to the ‘Linkedin’ string;
  • In the second iteration, 3248 out of 22,688 (14%) are related to the ‘Linkedin’ string;
  • Third iteration: 1,733 out of 3,682 (47%);
  • Fourth iteration: 539 out of 917 (59%);
  • Fifth iteration: 217 out of 330 (66%);
  • Sixth iteration: 119 out of 152 (78%);
  • Seventh iteration: 40 out of 51 (78%);
  • And so on through the tenth iteration.

An example of what I found on the 7th pass is:  m0c.nideknil

Another example is: lsw4linkedin, which was found on the tenth pass. To illustrate how the rules work for modifying words in the dictionary, below is the actual set of modifications used to get from the dictionary entry 'pwlink' to the successfully cracked password 'lsw4linkedin' over the ten iterations:

  1. pwdlink from pwlink with the rule "insert d in 3rd position"
  2. pwd4link from pwdlink with the rule "insert 4 in 4th position"
  3. pwd4linked from pwd4link with the rule "append ed"
  4. pw4linked from pwd4linked with the rule "remove 3rd char"
  5. pw4linkedin from pw4linked with the rule "append in"
  6. mpw4linkedin from pw4linkedin with the rule "prepend m"
  7. mw4linkedin  from mpw4linkedin with the rule "remove second character"
  8. smw4linkedin from mw4linkedin with the rule "prepend s"
  9. sw4linkedin from smw4linkedin with the rule "remove second character"
  10. lsw4linkedin from sw4linkedin with the rule "prepend l"

This is the deepest password found, i.e. the only one obtained in the last iteration.

This clearly shows that no matter how elaborate a password you choose, as long as it is based on words and rules, even if there are many words and many rules, it will probably be cracked. The fact is that on a huge file like the LinkedIn leak, every password you find can help you to get another one. That is because human-created passwords are not random, and programs like John the Ripper and dictionary attacks can use patterns, either already known or discovered in the password hash file, to greatly reduce the time needed to crack them.

Password Management

Thus, it is highly recommended to use a strong random password generator that is known to be actually random.

It is funny to note that a very old version of a command line tool called "mkpasswd" produced passwords based on a bad random salt and was generating only 32768 different passwords ( http://www.kb.cert.org/vuls/id/527736 ), this was reported and fixed 10 years ago, but I was still able to recover 140 passwords in the leaked file that had been generated by this vulnerable version of mkpasswd.

Evidence indicates that the hacker who made this leak public was most likely trying to get cracked passwords from an online community, a kind of crowdsource cracking. Since he probably possesses the list of logins as well, you might want to change your passwords in other accounts if you think he can access them with the information he has. Note that if you have unique passwords created with simple rules, you might change them as well. For example, if your password for LinkedIn is MyPW4Linkedin, a malicious cracker might guess that MyPW4Facebook might be your Facebook password.

It is also recommended to change your password if your username can be guessed from it, because every password cracker on the planet is currently playing with this password file.

The author of John the Ripper, Solar Designer, did a great presentation on the past, present and future of password security. Although the security industry has put a lot of work into making good hash functions (and there’s still more work to do), I believe that poorly chosen passwords are a concern. Maybe we should demand that our browsers (using secured storage as in Firefox Manager) or 3rd-party single-sign-on providers create easier solutions to help us resist the temptation of using simple passwords and re-using the same passwords with simple variations.

* Note: The hashes in the 120MB file sometimes had their five first characters rewritten with 0.  If we look at the 6th to 40th characters, we can even find duplicates of these substrings in the file meaning the first five characters have been used for some unknown purpose: is it LinkedIn that stores user information here? is it the initial attacker that tagged a set of account to compromise? This is unknown.