Learning Resources for Software Engineering Students »

Introduction to Password Storage

Author: Jeremy Choo
Reviewers: Amrut Prabhu, Marvin Chin, Tan Zhen Yong, Wang Junming

Overview

Many software applications use a username and password combination as user account credentials for authentication. Obviously, it is not a good idea for the software to store these credentials as plain text because if someone else were to gain access to them either lawfully (e.g., an employee who has access to the data) or unlawfully (e.g., someone hacking into the data storage), that person can use those credentials directly to impersonate the account owner. This article explains some techniques that can be used to store user credentials more securely:

Encryption
Hashing
Salting

It is a cryptography term generally referring to text before encyption or after decrypting it. Another term for it is cleartext.

Encryption

Encryption is the process of converting plaintext into ciphertext along with an encryption key. To decrypt the message, a decryption key is required to convert the ciphertext back into it's original plaintext for it to be read. This process is called decryption. Without the decryption key, the ciphertext is simply a bunch of meaningless data. There are two main types of encryption algorithms: Symmetric key algorithms, where encryption and decryption keys are identical or closely related, and Asymmetric key algorithms, where encryption and decryption keys are different.

It is a cryptography term generally referring to data after encrypting it.

Without this decryption key, decryption cannot be performed. Only the intended recipient of the data should have the decryption key.

One common example of encryption is the use of shifting each letter of the alphabet to the left or right by a number of positions. This is known as the Caesar cipher For example, if we chose to shift all the letters by 3, then the encryption key (and decryption key) for this algorithm would be 3. This would result in the following encryption algorithm:

Plaintext:  ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ciphertext: DEFGHIJKLMNOPQRSTUVWXYZABC

This would mean that encrypting the message I love you would result in L oryh brx

Naturally, this isn't a very good encryption method because even if one doesn't know the decryption key, the method can be easily brute forced by trying all 25 possible combinations and seeing if any of the results in a readable message.

A brute force attack usually takes very long to carry out.

Another encryption method is the Pigpen Cipher, where letters are substituted with symbols. The encryption key usually looks something like this:

In this case, each letter is substituted with a symbol that matches the exterior walls of where that letter is. For example, the letter W would be encrypted to the symbol . If the letter is located on the right side instead, then a dot is placed in the middle of the symbol to indicate that it refers to the right letter instead of the one on the left. For example, the letter P would be encrypted to the symbol

Naturally, the decryption key would be the encryption key itself, as it can be re-used to decrypt the ciphertext. Compared to the Caesar cipher, The Pigpen cipher is more resilient to brute force attacks if one doesn't know the decryption key, as it could result in one trying all possible substitution for each symbol.

I had a secret agent send me information about the secret ingredient of Mick's cheeseburgers earlier, as they were delicious and I found myself constantly eating it. I suspect it's some addictive substance to make customers keep coming back for more. To ensure that Mick didn't know their secret ingredient was being leaked, I had my agent send it in Pigpen cipher:

Encryption might seem like a good idea because the ciphertext is meaningless without the decryption key, which prevents all of the problems with storing the data directly in plaintext. However, because encryption is reversible, it is always possible to regain the original password from the ciphertext. Since the password is encrypted, the decryption key must also be stored somewhere. This means that if someone manages to hack into the application and read the encrypted passwords, it is also likely that they will be able to read the decryption key. With the decryption key, they will be able to decrypt all the passwords and read them anyway. This makes encryption unsuitable for password storage.

The key is usually randomly generated text that is used to encrypt the original data.

Hashing

Hashing is a one-way function that transforms a set of data into another set of data. Unlike encryption, when hashing is done, information that describes the original set of data is lost irrevocably in the process. This means that it is impossible to recover the original input from the hash.

Different types of hashing algorithms will result in different output.

Some examples of cryptographic hashing algorithms are MD5 and SHA1. However, when you perform hashing for passwords, you should use password hashing algorithms (such as Argon2, SCrypt and bcrypt). The main difference between these are that password hashing algorithms are designed to be slow to prevent brute force attacks, unlike cryptographic hashing algorithms which are built for speed. Despite that, if you plan on learning more about hashing algorithms, we recommend starting with MD5 and SHA1, as they are easier algorithms to learn about.

For example, a simple hashing algorithm that acts on numbers could add up all the digits in that number. This would mean hashing the number 1013 would result in 1+0+1+3 = 5. Hashing the number 761 would result in 14. Note that after hashing the number, there is no way to regain back the original number - data about the original number (such as number and position of digits) have been irrevocably lost in the process. Additionally, many different numbers could result in the same hash. For example, the numbers 101 and 20 both result in the hash of 2. This is called a hash collision. A good hashing algorithm attempts to minimize the amount of hash collisions such that the probability of it happening is close to 0. In the case of the MD5 algorithm, the probability of a hash collision given any two inputs is 1 in 2¹²⁸ which is 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456.

Why isn't hashing enough?

A rainbow table is a precomputed table of hashes for some set of passwords. Basically, people build huge tables of hashes wherein the plaintext is already known, so that attempting to crack hashes becomes a simple problem of looking up the hash in the table and its corresponding value, instead of attempting to reverse the hash. Through this method, it is very easy to crack simple hashes by simply doing a lookup.

An example of a service that provides this is Crackstation.

Since attacks like rainbow tables exist, passwords need another layer of security.

Salting

Salting refers to appending a string of text, unique to each user, to their password before hashing them. Since each user has a unique salt, this makes rainbow tables ineffective, as the majority of the precomputed hashes won't even contain the salt, so they wouldn't even matter anyway! In this way, the salt forces the attacker to recompute the rainbow table for each password in order to be able to effectively use it. This effectively converts the attack to brute force, as each hash must be recomputed for each possible password. Additionally, the computed rainbow table would only be useful for that specific user, as each user would have a different salt. It could take years before a password is cracked!

Note that the salt should be randomly generated, as opposed to choosing a static value that is different for every user. For example, if an application used the username of the user as the salt, then attackers can pre-generate rainbow tables for common usernames, causing users with common username to be vulnerable to a rainbow table attack, even with salt applied to their passwords.

Thus, the way to store passwords properly is to use a salted hash - take the user's password, append some data to it, and hash the result. That hash is the user's hashed password. When the user attempts to log in again, perform the same procedure again. If the hash matches, then you know that the user is who he says he is, even if you don't actually know the original password that the user provided.

One question that is commonly asked by developers is where to store the salt. The salt can be stored in plaintext, along with the user in the database. Since the goal of the salt is only to prevent precomputed rainbow tables from being used, it doesn't need to be encrypted in the database.

Many good password hashing algorithms today have built-in salts, such as Argon2, SCrypt and bcrypt. A good password hashing algorithm or library will salt automatically.

What if there is a server breach?

A common question asked by developers is how much all of these security measures actually matter. After all, if an attacker has gained access to the entire application, does it matter if passwords are stored in plaintext or not?

If an attacker has already gained access to the entire application, then he already has all the information that he could possibly want from the server. He would have access to all of the application's data, including data from users or from any analytics software that might be running. However, by adding salt and hashing passwords, the attacker still doesn't know customer's passwords and could take years to find out. Otherwise, since 59% of people use the same password across multiple sites ^source, the attacker could quickly try other websites such as banks to attempt to break into those accounts, which can potentially yield great returns in terms of information and/or money.

Additionally, when a user signed up on your website and provided you a password, they implicitly trusted you to keep that information safe and secure for them. In a sense, you do have a responsibility to keep their passwords secure. By doing proper password storage, if your servers ever get breached, you can assure your customers that their passwords are properly secured, and still maintain some of their trust in you.

Getting started

It is far too easy to screw up and make a mistake. Instead, use one of the free libraries that provide a crypto function that has already been well-tested by the community. Do not write your own crypto library.

Here are some libraries you can use to implement password storage:

PHP: password_hash
Java: SecretKeyFactory
Python: PassLib
JavaScript: bcrypt

Other resources

Secure Salted Password Hashing - How to do it properly is an excellent resource that explains how to perform salted password hashing correctly, including links to other good libraries and what else can be done.
Awesome Cryptography is a curated list of resources on cryptographic algorithms - articles, blogs, books, libraries and more.