RAFT | DISTRIBUTED SYSTEMS | Internet Tutorial
I broke my head over Raft Consensus Algorithm so that you don't have to
What is the Raft Consensus Algorithm in Distributed Systems — CMU 15440/640 P2
This was tough , very tough but all worth it in the end.
To give you a little context I am enrolled in Carnegie Mellon University in the Distributed Systems class where we learnt about Raft a consensus algorithm which sits in the very core of distributed systems as one the most beloved algorithms.
As part of the course work one our individual projects is to implement the Raft consensus algorithm and it is split into two parts and while the first part seems simple enough it is on the second part that the game actually starts.
After a week of struggle where I had to endure just 6 hours of sleep in 3 days and pulling one all nighter I was able to complete the project and while I am at fault for missing minute details and spending hours debugging this blog serves exactly that reason a free checklist guide of all the areas you should keep an eye out for to not snowball it into a headache before your deadline.
So sit back and get your cup of tea or coffee ready for this endeavour.
Understanding about the algorithm
The post assumes you know about Raft because it dives headfirst into it , if you don't then please view the following links.
- First Read the Raft Paper
- The visualize Raft
- Additionally if you would like learn more click this link
Now lets get on with it.
The Heartbeat of Raft
The heartbeats generated by the leaders in raft is equivalent to oxygen in the system. It performs various functions and is super useful that your code for the heartbeat performs all these responsibilities
- Notifying followers of a new leader
- Maintaining leadership
- Determining Log Inconsistencies
- Determining term Inconsistencies
- Propagating the updated commit index to followers.
The heartbeat by the leaders are sent using the appendEntries RPC , as per the raft paper the following things are sent.
Notifying followers about a new leader
When a leader is elected for the first time it is responsible for sending heartbeats at periodic intervals to all the followers. This lets the other followers know that a leader is elected and the election timeout of the followers have to be reset.
The below image illustrates how heartbeat lets other followers know a leader is elected.
Maintaining leadership
Similarly the leader must periodically send heartbeats to all the peers and if the peers did not receive a heartbeat within their heart beat window they initiate an election to gain a leader position.
When a follower receives a heartbeat it must reset its election timeout regardless if its logs are inconsistent or not.
The only time the election timeout is not reset when another peer has a bigger term which indicates it promoted itself to a candidate. This example illustrates that.
Initial state
A (Leader): Term 5
B, C, D, E (Followers): Term 5
Network partition occurs
Partition 1 (minority): A, B
Partition 2 (majority): C,D, E
In Partition 2
C election timeout triggers.
C increases term to 6 and becomes candidate and requests votes
C gets majority (3 out of 5 total votes) and becomes
leader for term 6.
Meanwhile, in Partition 1
A thinks it's the leader and sends heartbeats to B
A and B still remain in term 5, not aware of the new leader C
Network partition reconnects, all nodes can now communicate
C sends heartbeat with term 6 to all nodes.
A gets higher term number, realizes it's no longer the leader.
A updates term to 6 and becomes follower.
B similarly follows.
The network partition illustrates just one example , what could also happen is that the old leader promotes itself to a follower and now there is no leader in the system.
Old leader increments its term and starts an election and wins it since it has the most updated log. How a follower with higher term deals with depends on the state of the system.
Determining log Index inconsistencies
The heartbeats have a parameter prevLogIndex which is sent by the leader. When a follower logs is lesser then the prevLogIndex this lets the leader know that this followers logs doesn't have all the entries and it needs to update its logs with new entries.
We can understand this with a simple example
Initial state
A: [1, 2, 3, 4, 5]
B: [1, 2, 3, 4, 5]
C: [1, 2, 3]
Leader (A) sends heartbeat to followers, including
B receives the heartbeat:
It compares log and term and everything matches , reply success
C receives the heartbeat:
Compares log with the leader information
C log is missing entries 4 and 5
C replies failure
A receives C response
Recognizes the inconsistency
Decrements nextIndex for C to 4
Asends another AppendEntries to C with entries starting from index 4
This process continues until C log is consistent with A.
Determining term Inconsistencies in Log
Each heart beat contains a term PrevLogTerm which is used to compare and learn about the current term of the followers last log with the leader(F0) .
If there is a mismatch it indicates that the follower has the wrong log entries of some previous term which is not correct.
The below example illustrates this.
Initial state (Term 5)
A (Leader): [1, 2, 3, 4, 5]
B, C, D, E (Followers): [1, 2, 3, 4, 5]
Network partition occurs
Partition 1 (minority): A, B
Partition 2 (majority): C, D, E
In Partition 2
C new leader for term 6
C adds two new entries: [1, 2, 3, 4, 5, 6(t6), 7(t6)]
D and E replicate these entries
In Partition 1:
A adds two entries same time: [1, 2, 3, 4, 5, 6(t5), 7(t5)]
B replicates these entries
Minority partition all nodes are disconnected and one node
E from majority partition is disconnected.
The node E is added to the minority parition and all nodes are reconnected.
State of paritions
Partition 1 (old minority): A,B,E
Partition 2 (old majority): C,D
New election occurs and A is leader in partition 1 again and E replicates
entries of A.
Now term of leader A is also 6 but its log have last term as 5.
Current state before partition heals:
A,B,E: [1, 2, 3, 4, 5, 6(t5), 7(t5)] // Lower Log last term
C,D: [1, 2, 3, 4, 5, 6(t6), 7(t6)] // Higher Log last term
Network partition heals:
C sends heartbeat to all nodes
A, B and E receive AppendEntries lastlog term (6)
Log inconsistency resolution
A, B and E find that their entry at index 7 has the wrong term
(5 instead of 6)
They inform C of the inconsistency
C decreases match index for those nodes and send heartbeat again
Includes entries [6(t6), 7(t6)] to replace incorrect entries at the end
A, B and E update their logs
They replace their last two entries with the correct ones from C
Final state (all nodes):
[1, 2, 3, 4, 5, 6(t6), 7(t6)]
Propagating the updated commit index to followers.
When a leader adds a new entry and propagates the entry to all followers and receives success replies and updates its commit index it must propagate that to the followers with every heartbeat.
This is done using the leader commitIndex in the heartbeats.
Updating the Commit Index
Leader
This is equally important for the leader. The logic to update the commit index for the leader is as follows
If there exists an N such that N > commitIndex and a majority of matchIndex[i] ≥ N and log[N].term == currentTerm: then set commitIndex = N
I had coded this logic wrong which lead to my leader not promoting commitIndex and waiting for 2 commits to realise that values had been committed. This lead to the first entry being missed and propagating and increasing the commitIndex for next all entries.
Let's consider a Raft cluster with 5 nodes .
The current state is:
currentTerm: 3
commitIndex: 5
Leader's log (A): [1(t1), 2(t1), 3(t2), 4(t2), 5(t2), 6(t3), 7(t3), 8(t3)]
matchIndex for followers: [5(B), 7(C), 6(D), 4(E)]
The leader checks for the highest N that satisfies all conditions:
N must be greater than commitIndex - 5
N must be replicated to a majority - at least 3 out of 5 total nodes
The log entry at index N must be from the current term -3
One thing we have to consider is that the leader will have this index already
present.
The leader examines potential N values:
N = 6:
6 > 5 (commitIndex) - satisfied
3 peers A, B and C have matchIndex ≥ 6 - satisfied
log.term == 3 (currentTerm) - satisfied
N = 7:
7 > 5 (commitIndex) - satisfied
2 peers A , C have matchIndex ≥ 7 - not a majority
The leader determines that N = 6 is the highest value
satisfying all conditions.
The leader updates its commitIndex to 6.
This logic must be checked on the leader every time a reply is received from a peer.
No other logic like frequency of the success replies for each log entry does not work.
Follower
When a follower receives a higher commit index on a heartbeat it must take the call to update its self commit index. The logic for it is as follows
If leaderCommit > commitIndex then set commitIndex = min(leaderCommit, index of last new entry)
Logic of Handling Election
When a peer receives an election entry then it has to evaluate a bunch of conditions to determine whether it must provide the vote or not.
Peer has more update logs
There are a bunch of conditions that can happen here which can make the peer reject the candidate.
- Peer has the same last log term as the candidates but has a higher index compared to the candidates
Example is this
Candidate's log: [1(t1), 2(t2), 3(t3), 4(t3)]
Peer's log: [1(t1), 2(t2), 3(t3), 4(t3), 5(t3)]
Candidate and Peer have same last log term of 3 but peer has higher index
- Peer has higher last log term compared to the candidates last log term
Example of this is
Candidate's log: [1(t1), 2(t2), 3(t3), 4(t3)]
Peer's log: [1(t1), 2(t2), 3(t3), 4(t4)]
Peer has a higher term in their last log entry even though index is same
Peer has higher Term
If a peer has a higher term then a candidate requesting for vote then the vote is rejected without any other check.
Candidate must promote itself to follower immediately and wait until election timeout of some other node starts.
Starting Index of Logs
Raft starts its log index at 1 and the code becomes much easier if this is followed in the scripting part. It is convenient to insert an empty log entry to force the main insertions at log index 1.
Unlike arrays which start at index 0 since Raft logs start at index 1 we put a nil entry at index 0 which makes the code so much more convenient.
I did not do this and it caused a lot of code changes later since everything depends on the index of the logs it was a bloodbath to bring everything working again after this huge change.
Conclusion
Compared to other projects the Raft algorithm is definetely one of the most challenging ones. However it is one of the coolest projects that I have done. This online free tips is something I wish I had when I started this project to prevent writing the wrong solution.
The Raft algorithm finds its value many places from ETCD in Kubernetes to Kafka and is definitely an inexpensive algorithm to know about which can be improved and personalized for large scale distributed systems.
If you find these quick tutorials and instructions useful then please consider subscribing and clapping for the post.