Consistent Hashing explained — A detailed insight

In recent years, almost all the software applications have moved to distributed architecture to support increasing load and traffic by leveraging concepts such as cloud computing and big data.

In this blog, I will be discussing Consistent Hashing which is used in building scalable distributed systems. Consistent Hashing makes distributed systems elastically scalable.

Before jumping straight into Consistent Hashing, lets quickly revisit hashing and then try to understand how the need for consistent hashing emerged.

What is Hashing?

Hashing is the process of converting a given piece of data of arbitrary size into another value of fixed size, usually a integer, called hash value or just hash. In order to generate this hash value, a special function called hash function is used, which computes the new hash value based on some mathematical algorithm. A good hash function uses a one-way hashing algorithm which ensures that the hash cannot be converted back into the original key.

One way hash functionSource

For example, hashing can be used to convert strings of arbitrary size to some fixed integer value between 0 to 100. Given any string it will always try to map it to any integer between 0 to 100.

Hello World  --> 30
Test Message --> 59

The hash computed through hash function is used for storing the objects in a Hash Table. Hash Table is a data structure that helps in constant time lookups. In a hash table, data is stored in an array format, where each data value has its own unique index value. In order to store objects in a Hash Table, we use one of the fields of the object as a key and compute hash over this key.

Modular Hashing

The range of integers output from hash function is usually much larger than the size of array in hash table. In order to store all the objects within the array range of hash table, we perform a modulo operation over the hash computed to find the index where the object is stored.

hash = hash_func(object->key)
index = hash % N where N is the size of array.

Using modulo operation guarantees that indexes of all the objects will be in range 0 to array_size-1. This is known as Modular Hashing.

Distributed Hashing and its challenges

In distributed systems, resources such as CPU, Database, Cache, etc. are spread across multiple servers. This allows to overcome the memory limitations of using a single computer. In distributed systems, in order to allocate a resource to a request, we first need to select a server from which the resource will be allocated. This is where distributed hashing comes into picture. Let us take an example to understand this better.

Consider a School management system used to store information of it students. Each student can have the following entity structure:

struct Student {
int roll_no;
string Name;
string Address;
};

Since roll_no is unique for each student, it can be used as a key for Student object here.

We will mainly be performing the following two operations:

  1. Insert information of a new Student into database upon admission.
  2. Fetch one or more student details using its roll_no.

As a school will have to maintain data of large number of students, it will be difficult and less reliable to store all the data into one database server. To tackle this, the data is distributed across multiple servers. Now lets understand how the above two operations will function with multiple servers in place.

  1. Insert a New Student information: In order to insert information of a new student, we can use modular hashing to decide the server at which to store the data of new student.
hash = hash_function(student->roll_no)
index = hash % N where N is the number of servers.

Lets assume we have 3 servers S0, S1 and S2 available.

Roll No.                   Hash         Server(Hash%3)123456                      85             1216801                      26             2827510                      33             0253814                      47             2712952                      71             1531245                      12             0

2. Fetch Student data: Similar to the above approach, in order to find the server at which data is stored, we can use modular hashing and perform modulo N over the hash value computed.

Challenges with the above approach

Though using multiple servers to store data makes the system much more reliable, however in a distributed system, the number of server is not constant. A server can come and go at any point in time. We need to ensure to take into account such scenarios as well.

Case I — Server removed : Suppose one of the server S2 crashed due to some reason. Now only S1 and S3 are available. In that case, if we try fetching data for a student with hash value 26, then the server number will be S0 (26%2 = 0). However, the data was actually stored on server S2 (26%3 = 2). This will result in a miss. Thus in order to solve this we will have to perform Rehashing of all the data.

Case II — Server Added : Suppose we added a new S3 into the system. Now we have total 4 servers (S0, S1, S2 and S3). In that case, if we try fetching data for a student with hash value 33, then the server number will be S1 (33%4 = 1). However, the data was actually stored on server S0 (33%3 = 0). This will again result in a miss. Similarly, in order to solve this we will have to perform Rehashing of all the data and move the objects to new computed servers. This will lead to huge data migration problem.

Rehashing of huge amounts of data every time number of servers change is a very big problem and will degrade the performance of the system severely.

Lets see how Consistent Hashing can help us solve this problem.

Consistent Hashing

Consistent hashing is a hashing technique that performs really well when operated in a dynamic environment where the distributed system scales up and scales down frequently. Consistent hashing works independently of the number of servers or objects in a distributed hash table. Without a fixed number of servers, scaling up and down is a lot more efficient as it would minimize number of relocation of keys and prevent potential downtime or performance issues.

In consistent hashing, the entire range of hash values are mapped onto an abstract circle called hash ring. That means that the minimum possible hash value, zero, would correspond to an angle of zero and the maximum possible value of hash function would correspond to an angle of 360 degrees, and all other hash values would fit somewhere in between.

The hash function outputs a position on this ring. We compute position of keys of both the objects as well as servers(using their IP address or other ID of server as key) and place them accordingly on the hash ring. Thus both objects and servers share the same space in consistent hashing.

Now in order to locate an object, we first compute the position of the object’s key using the hash function. Once the position is computed, we locate the key on the hash ring and move in clockwise direction until we find a server. The object is located on the first server encountered while moving in clockwise direction.

Mapping Keys to Servers with Consistent Hashing

In the image above, we have 4 server(S1-S4) and 6 object keys(K1-K6), mapped onto the hash ring. In order to find the position of the objects/server on the ring, we use a hash function that returns the position. This can be a simple hash function that computes modulo 360 over the hash value of key.

Lets see how Consistent hashing solves the problem of Rehashing upon server removal/addition.

Case I — Server removed : With consistent hashing, when a server is removed, only the keys from that server needs to be relocated. For e.g, in the image above, if S1 crashes, then we need to only relocate key K6 to Server S2. However, there is need to relocate any other Key. This has a drastic improvement over modular hashing where we has to rehash and migrate all of the object keys.

After removing server S1

Case II — Server Added : If we add a new server S5 between S3 and S4, then we only need to relocate the Keys between S3 and S5. As shown below, only key K4 is relocated to server S5. There is no relocation of any other Key.

After adding server S5

As we see, using consistent hashing minimizes the problem of data migration by a ton upon server removal and addition.

Problem with Pure Consistent Hashing

Even though using pure consistent hashing minimizes the problem of data migration of all the keys, it suffers from the problem of non-uniform distribution of keys/data. This happens due to constant up-scaling and down scaling of the servers. For e.g, lets revisit Case I — where server S1 was removed.

When server S1 gets removed

As shown above, upon removal of server S1, keys K6, K1 and K2 are mapped to Server S2. This is not ideal scenario as 50% of keys is served by server S2 now.

In order to solve this issue, we use virtual servers with consistent hashing.

Virtual Servers

In pure consistent hashing, we were placing each server only at a fixed location in the hash ring which caused the non-uniform distribution of keys. In order to make sure the data/keys are uniformly distributed across all servers, we assign multiple locations to each server on the hash ring. These locations on the hash ring are called virtual servers. This can be done by using multiple hash functions instead of one for each server.

For e.g, for assigning location to servers on hash ring, we pass them through three hash function h1,h2 and h3. In that case, we will get three different locations for each server.

Using virtual servers with consistent hashing

This will solve the problem of non uniform distribution of data/keys on server removal/addition. As shown in the image above, if server S1 crashes now, its keys get re-mapped to S4, S2 and S3. Similarly if server S3 crashes, its keys get re-mapped to S4, S1 and S2.This will result in uniform increase in data at all the other servers and will not overload one server.

Similar to removal of server, upon addition of a new server, there will be uniform reduction in data handled by each server. We can have any number of virtual servers for each server. With N hash functions, there will be N virtual server for each server. If N is large enough, it will lead to uniform distribution of data at all times across all servers.

In general, only k/N keys need to be remapped when k is the number of keys and N is the number of servers.

Summary

Consistent hashing is used throughout distributed systems for solving multiple problems. It is used in Web Caches, Load Balancers and multiple distributed storage systems. Memcached, a distributed memory object caching system has clients that support consistent hashing. Systems like Chord, which is a distributed hash table implementation, and Amazon’s Dynamo, which is a key-value store also take advantage of consistent hashing.

Software Engineer @Grab | Avid learner| Curious Mind

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store