Distributed systems are becoming increasingly popular today, with the world’s ever-growing technological expansion. It is a large and complicated area of computer science research.
One of the biggest problems of Distributed Systems is reaching consensus among the distant nodes. In this article, we will discuss in detail Raft which is one of most popular and modern consensus algorithms.
What is distributed consensus?
Before diving into Raft, lets begin by understanding what distributed consensus means.
In distributed computing systems, software applications are distributed across multiple nodes and work together to appear as a single coherent unit to the end user. These nodes coordinate with each other and share information to reach a decision whenever required. Distributed consensus is a fundamental requirement for constructing fault-tolerant, strongly-consistent distributed systems.
Two major consensus algorithms have dominated the production system: Paxos, the early, famously subtle, algorithm; and Raft, which was developed as a more understandable alternative to Paxos. Raft aimed to solve the problem of consensus by decomposing it into multiple subproblems and solving them.
What is Raft?
The name Raft is derived from Replicated And Fault-Tolerant. Raft is a more modern distributed consensus algorithm which aims to be highly reliable and easy to understand. It solves the problem of multiple servers working together to agree upon a decision by sharing information with each other, even in the event of a failure. It is used by many distributed software solutions such as Vault, Consul, RabbitMQ, etc.
Raft works by electing a leader in the cluster of nodes. Other nodes follow the instructions of the leader. The leader is responsible for accepting requests from client and ensuring the logs(containing request command) are replicated to all other servers. Only when a log gets replicated to majority of the server, it can be executed.
Raft decomposes the problem of reaching consensus into three subproblems:
- Leader Election: A leader must exist at all times within the system. In case of any failure of the current one, a new leader must be chosen from the remaining nodes.
- Log Replication: The leader needs to ensure all the logs(requests) it has received from clients are replicated at all other server nodes.
- Safety: Once a server has committed a log at an index, no other server can commit a different log at the same index.
Before diving deep into these subproblems, we need to be aware of some fundamentals and nomenclature that govern the Raft algorithm.
Fundamentals and Nomenclature
Each server in Raft exists in one of the three states: Follower, Candidate or Leader.
At all times, there must exist only one leader within the system. Followers are passive entities, they only respond to requests from leader or candidates and never raise any requests of their own. When an existing leader fails, one or more server nodes transition to candidate state. A candidate is a potential future leader. Leader receives requests from clients and ensure the request log is replicated at all other servers.
In Raft, the unit of time is called a term. Terms are of arbitrary length and do not have any global value. Each server maintains its own current term value. Terms are represented as numbers which monotonically increase over time and are used for inter-process communication between the nodes. Every term begins with an election and after a successful election, a single leader manages the cluster till the end of term. In case election fails, a term may end up without a leader.
Raft uses of two remote procedure calls (RPCs) to carry out its basic operation:
- RequestVotes: used by candidates during election
- AppendEntries: used by leaders for sending periodic heartbeats(contains no log entry) and replicating logs at followers.
Logs in Raft are used by leader for replicating the request it receives from client at other server. Each log has a term number, a command and a log index associated with them.
The term of a log is the term at which the leader had published the log. Command is the request that client had raised to the leader. Log index is the position of the log in the log queue. For e.g: in the image above, the log in 1st row and 4th index has term number 2(marked by yellow) and command move.
Logs are always sorted in ascending order of <term number, index number>. Term number takes precedence, which means that a log with lower term must appear before a log with higher term in the queue.
A log is said to be committed if majority of the servers have replicated that log. Also, every log entry before a committed log are also considered committed. In the image above, logs till index 7 are replicated on 3 out of 5 server and hence are said to be committed. Only committed logs are safe to be executed.
Now that we have understanding of the fundamentals, lets dive into the three subproblems and their solution in Raft.
1. Leader Election
In Raft, at all times there must remain one and only one leader of the cluster. All the follower nodes have a random timeout(between ~150–300ms) during which they expect a heartbeat from the leader. Leaders send AppendEntries RPC with empty log entry as a heartbeat to all its followers.
Once followers receive the heartbeat, they immediately reset their timeout. However, if no heartbeat is received by the timeout interval, followers assume that the current leader has crashed and they increment their term number and change to a candidate.
As shown above, the cluster starts with no leader initially(all orange nodes). Due to random timeouts at nodes, one of the node will timeout first, which in this case is node S3. Upon timeout, S3 increments its term number, changes to a candidate and sends RequestVote RPC to all other nodes. S3 receives votes from all other nodes and turns into a leader(turns blue with no timeout). After becoming leader, s3 starts to send heartbeats to other nodes. When other nodes receive heartbeats from s3, they become followers immediately(turn blue with timeout).
In case more than one node timeouts at the same moment and turn into candidates at the same time, an equal split of votes might occur. This is known as a split vote situation. In this case, each candidate will timeout and a new election will be triggered.
Another possible scenario is that while s3 is waiting for votes from other nodes, it receives an AppendEntries RPC from another node claiming to be the leader. In that case, s3 will compare its current term number with the term number in the the RPC received. If it has a lower term number, it will give up and accept the claiming node as the leader.
2. Log Replication
Once a leader is chosen, it is responsible for receiving requests from the clients and replicating the log(carrying the request command) at other nodes. Clients can only communicate with the leader of the cluster at any moment. The leader publishes commands received from the client in the form of logs to all its followers using AppendEntries RPC.
Every log contains a term number , index number and the command. Upon receiving a log from leader, the followers append the log to their own local log queue and send an ACK back to the leader. If a follower has crashed or is slow to respond, the leader keeps retrying in subsequent messages until the follower responds.
Once a log is replicated at majority of the nodes, it is considered committed and safe to execute. For e.g: If out of 5 servers, the log is replicated successfully at 3 servers, it is considered committed and safe. The leader executes the command of committed log at its machine and returns the result to the client.The leader then notifies the followers (using subsequent AppendEntries RPC) that they can commit the message in their state machines as well.
If one or more of the follower nodes are not available or slow to respond, it can lead to inconsistencies in the log queue. It is the responsibility of the leader to solve any log inconsistencies. In order to do so, it uses the following defined Log Matching Properties :
- If log entries on different servers have the same index and term, then:
- They store the same command
- All preceding logs are identical as well
2. If a given entry is committed, all preceding entries are also committed.
In order to maintain the Log Matching Property defined above, the leader also includes the term and the log index of the previous log entry. When a follower receives an AppendEntries RPC with log entry, it first checks if the
<index, term> of the previous log entry of leader exists in its own logs. If not, it rejects the log and the leader retries with a lower log index.
In order to ensure that the above leader election and log replication work under all conditions, certain safety conditions needs to be added to the Raft algorithm.
- During election, a candidate’s log is at least as up-to-date as the other nodes. If not the other nodes will not vote for the candidate.
- Candidate’s while sending RequestVote RPC during election, also includes information about the last committed log by the candidate.
- While voting, the voter node checks if the candidate’s last committed log is as updated as its own or not. If not then it declines the vote to the candidate. This ensures all future leaders will always have all committed entries.
Raft, in its goal to have a much easier understandability, does stands tall. Paxos might have been the one of the first algorithm to solve the problem of distributed consensus, but its very complicated to understand. Due it its simplicity, Raft has been adopted by many software applications such as Consul, Vault, etc. as mentioned earlier as well. Raft’s implementation can be found in many programming languages such as C++, Java, Golang, etc. here. For more details on Raft, refer to the references below.
- Raft In Search of an Understandable Consensus Algorithm — An extended version of the original Raft paper by Diego Ongaro and John Ousterhout
- An introduction to the Raft Distributed Consensus Algorithm
- Designing for Understandability: the Raft Consensus Algorithm