This post offers an overview of the Raft algorithm used for implementing distributed consensus. Within the post, we’ll look at what consensus and distributed consensus means, why Raft was developed and give an overview of how it works, along with links to other resources.
An Overview of Consensus
We rely on systems that rely on consensus for our everyday computing. Websites like Twitter and Facebook demand a meaningful solution to consensus to be able to provide distributed solutions to execute the distributed architectures they employ for high availability. However, the conundrum is this: one of the primary areas of practical research in distributed computing is to determine if consensus is indeed possible. In 1985, researchers Fischer, Lynch and Patterson (FLP) laid to rest this long-standing question, determining that consensus in an asynchronous system was indeed impossible.
Nonetheless, practical research into consensus continues, defining consensus as “the assurance of the following three properties in a distributed system:
- Validity: any value decided upon must be proposed by one of the processes
- Agreement: all non-faulty processes must agree on the same value,
- Termination: all non-faulty nodes eventually decide”.
What is Distributed Consensus?
Imagine we have a single node system. For this example, consider the node as a database server, which stores a single value. There is also a client that is able to send a value to the server. With one node, coming to agreement, or consensus, is straightforward. How is consensus reached, however, if there are multiple nodes? This is the problem of distributed consensus. The Secret Lives of Data site explains the following concepts through a series of easy-to-understand visualizations.
What is Raft?
Raft is a protocol for implementing distributed consensus, designed as an alternative to Paxos. It was intended to be more straightforward to understand than Paxos through separation of logic. It is also proven to be formally safe and offers some additional features.
Raft provides a generic way to distribute a state machine across a cluster of computing systems, guaranteeing that each node in the cluster agrees upon the same set of state transitions. It has several open-source reference implementations, with full-specification implementations available in Go, C++, Java and Scala.
In the Raft algorithm, a node can exist in 1 of 3 states:
- The Follower state: the follower is passive – they cannot issue their own request, but only respond to requests from candidates and leaders;
- The Candidate state: this state is used to elect a new leader;
- The Leader state: the leader takes care of all client requests (if a client sends a message to a follower, the follower sends it on to the leader).
Raft decomposes consensus into three relatively independent sub-problems:
All nodes begin in the follower state. If the followers don’t hear from a leader, they can become a candidate. As a candidate, the node then requests votes from other nodes. Nodes reply directly with their vote. If the candidate node gets a majority of the votes of the nodes, it becomes the leader. This process is known as Leader Election.
There are two timeout settings in Raft that control elections:
- The Election Timeout e. the amount of time a follower waits before becoming a candidate (this is randomized to be between 150ms and 300ms). Following the election timeout, the follower turns into a candidate and begins a new election term, votes for itself and initiates Request Vote messages, sending them out to other nodes. If the node receiving the message hasn’t already voted, it votes for the candidate and the node is able to reset its election timeout. Once the candidate receives the majority of the votes, it becomes the leader and then sends out Append Entries messages to its followers.
- The Heartbeat Timeout is the amount of time specified for when the Append Entries messages are sent out. Followers respond to each Append Entries The election term continues until a follower ceases to receive heartbeats and becomes a candidate itself. As a majority of votes is always required for a follower to become a leader, only one leader can be elected each term.
If two nodes simultaneously become candidates, then a split vote may take place. Two nodes begin an election for the same term, and each reaches a single follower node before the other one. Each candidate can then receive no more votes for this term. They have to wait for the next election when they can try again to receive a majority of votes and become the leader.
The term number increases monotonically. Each server stores the current term number. This is also exchanged during every communication. If one server has a current term that is smaller than another, it will update its current term to the larger value. If a candidate or leader determines that its term is out of date, it will immediately revert to follower state. If a server receives a request with a stale term number, it will reject the request.
Going forwards, all changes to the system now move through the leader. The data flows in one direction only: from the leader to the followers. All changes need to be replicated across the system to all nodes. This happens by way of the same Append Entries messages that was used for the heartbeat timeout.
A client first sends a change to the leader. The change is added to the log of the leader, then it is sent on to the followers on the next heartbeat.
Every change that occurs is noted as an entry in the node’s log. The log entry is uncommitted, meaning it won’t update the node’s value (e.g. 5) until the node replicates it to the follower nodes. An entry is only committed once a majority of the followers acknowledge it: the leader has to wait until the majority of nodes have written the entry. Following this, the entry is committed on the leader log making the node state “5” (in this instance). The leader then lets the followers know that the entry has been committed by sending them a response. This results in consensus among the cluster about the system state. This process is known as Log Replication.
Raft has the ability to stay consistent in the face of network partitions. Through partitioning the network, we can have two leaders in different terms. The network partition can also be healed. The node in an uncommitted state will see the node with the higher election term and step down to allow them to be leader. Both nodes roll back their uncommitted entries in order to match the new leader’s log, making the log consistent across the cluster.
If one server has committed a log entry at a particular index, no other server can apply a different log entry for that index. This is important to guarantee that all logs are consistent and the state machines execute the same set of commands.
Raft offers a guarantee for each of these safety components:
- Election Safety: only one leader can be elected per term;
- Leader Append-Only: only a leader can append new entries to its logs (it can’t overwrite or delete entries);
- Log Matching: if two logs contain an entry both share the same index and term, then the logs remain identical in all entries up through the given index;
- Leader Completeness: if a log entry is committed in a given term then it will also be present in the logs of the leaders for the duration of the term;
- State Machine Safety: if a server has already applied a specific log entry to its state machine, then no other server can apply a different command for that same log.
The four first rules are guaranteed by the details of the Log Replication algorithm. The State Machine Safety is guaranteed by a restriction on the election process.
Raft deploys a two-phase approach for changing cluster membership. The first phase involves it switching to an intermediate configuration called joint consensus. As soon as that is committed, it switches to the new configuration. Joint consensus combines the new and old configurations, allowing individual servers to migrate between configurations at different times without compromising safety. Joint consensus also allows the cluster to continue to service client requests across the configuration change.
When a leader receives a message related to configuration change, it will store and replicate the entry for join consensus C<old, new>. A server will always use the most up-to-date configuration in its log to make decisions even when it is not committed. Once joint consensus is committed, only servers with C<old, new> in their logs can become leaders.
Why is Consensus and Raft Important?
As modern computing infrastructure becomes increasingly distributed and more complicated, the ever-increasing number of users are making higher demands for reliability, latency, security and consistency. Consensus algorithms enable a collection of machines to work as a coherent group, which can outlast the failure of some of its members. This means that they play a critical role in building reliable large-scale software systems. Paxos has been the popular consensual algorithm for the last decade, and is used as the main vehicle to teach students about consensus. Diego Ongaro while a PhD student in Computer Science at Stanford University, worked with Professor John Ousterhout on Raft as a distributed consensus algorithm that was designed to be easier to understand than Paxos.
In their co-authored paper, In Search of an Understandable Consensus Algorithm, the two explained why they set out to develop Raft:
“After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is signifi- cantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works. The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos”.
In the paper’s conclusion, the authors discuss why understandability is so important for an algorithm. As they state, none of an algorithm’s other goals (typically including correctness, efficiency and conciseness) can be achieved if the algorithm itself is not understood. The developers need to be able to render the algorithm into a practical form, which “will inevitably deviate from and expand upon the published form”. In order to do this, the developers need to understand the algorithm on a deep enough level to create intuitions about it of their own, and decide what features they will retain as they go on to implement it. The success of Raft speaks for itself. There are many implementations of Raft available in various stages of development.
The RAFT GitHub page offers a wealth of links to these implementations, other publications and talks on Raft. Some of the key links are:
- The “Raft Paper”, In Search of an Understandable Consensus Algorithm (Extended Version)by Diego Ongaro and John Ousterhout
- The PhD dissertation by Diego Ongaro – expands on the content of the paper in much more detail, and it includes a simpler cluster membership change algorithm.
- Talk on Raft by Diego Ongaro at BuildStuff 2015
- Talk on Raft by John Ousterhout, August 2016 at CS@Illinois Distinguished Lecture Series
The page offers many more links to further publications and talks, along with a detailed list of courses teaching Raft and the different implementations available.