RAFT | DISTRIBUTED SYSTEMS | Internet Tutorial

I broke my head over Raft Consensus Algorithm so that you don't have to

What is the Raft Consensus Algorithm in Distributed Systems — CMU 15440/640 P2

9 min readFeb 3, 2025

This was tough , very tough but all worth it in the end.

To give you a little context I am enrolled in Carnegie Mellon University in the Distributed Systems class where we learnt about Raft a consensus algorithm which sits in the very core of distributed systems as one the most beloved algorithms.

As part of the course work one our individual projects is to implement the Raft consensus algorithm and it is split into two parts and while the first part seems simple enough it is on the second part that the game actually starts.

After a week of struggle where I had to endure just 6 hours of sleep in 3 days and pulling one all nighter I was able to complete the project and while I am at fault for missing minute details and spending hours debugging this blog serves exactly that reason a free checklist guide of all the areas you should keep an eye out for to not snowball it into a headache before your deadline.

So sit back and get your cup of tea or coffee ready for this endeavour.

Understanding about the algorithm

The post assumes you know about Raft because it dives headfirst into it , if you don't then please view the following links.

First Read the Raft Paper
The visualize Raft
Additionally if you would like learn more click this link

Now lets get on with it.

The Heartbeat of Raft

The heartbeats generated by the leaders in raft is equivalent to oxygen in the system. It performs various functions and is super useful that your code for the heartbeat performs all these responsibilities

Notifying followers of a new leader
Maintaining leadership
Determining Log Inconsistencies
Determining term Inconsistencies
Propagating the updated commit index to followers.

The heartbeat by the leaders are sent using the appendEntries RPC , as per the raft paper the following things are sent.

The parameters sent in appendentries rpc which is used by heartbeats and log entry addition. — Details for append Entries RPC

Notifying followers about a new leader

When a leader is elected for the first time it is responsible for sending heartbeats at periodic intervals to all the followers. This lets the other followers know that a leader is elected and the election timeout of the followers have to be reset.

The below image illustrates how heartbeat lets other followers know a leader is elected.

Flowchart of how the leader is elected and how heartbeat comes into play

Maintaining leadership

Similarly the leader must periodically send heartbeats to all the peers and if the peers did not receive a heartbeat within their heart beat window they initiate an election to gain a leader position.

When a follower receives a heartbeat it must reset its election timeout regardless if its logs are inconsistent or not.

The only time the election timeout is not reset when another peer has a bigger term which indicates it promoted itself to a candidate. This example illustrates that.

Initial state
  A (Leader): Term 5
  B, C, D, E (Followers): Term 5

Network partition occurs
  Partition 1 (minority): A, B
  Partition 2 (majority): C,D, E

In Partition 2
  C election timeout triggers.
  C increases term to 6 and becomes candidate and requests votes
  C gets majority (3 out of 5 total votes) and becomes 
  leader for term 6.

Meanwhile, in Partition 1
  A thinks it's the leader and sends heartbeats to B
  A and B still remain in term 5, not aware of the new leader C

Network partition reconnects, all nodes can now communicate
  C sends heartbeat with term 6 to all nodes.
  A gets higher term number, realizes it's no longer the leader.
  A updates term to 6 and becomes follower.
  B similarly follows.

The network partition illustrates just one example , what could also happen is that the old leader promotes itself to a follower and now there is no leader in the system.

Old leader increments its term and starts an election and wins it since it has the most updated log. How a follower with higher term deals with depends on the state of the system.

Determining log Index inconsistencies

Followers S2 to S5 have inconsistent log entries which must be updated

The heartbeats have a parameter prevLogIndex which is sent by the leader. When a follower logs is lesser then the prevLogIndex this lets the leader know that this followers logs doesn't have all the entries and it needs to update its logs with new entries.

We can understand this with a simple example

Initial state
  A: [1, 2, 3, 4, 5]
  B: [1, 2, 3, 4, 5]
  C: [1, 2, 3]

Leader (A) sends heartbeat to followers, including

B receives the heartbeat:
  It compares log and term and everything matches , reply success

C receives the heartbeat:
  Compares log with the leader information
  C log is missing entries 4 and 5
  C replies failure

A receives C response
  Recognizes the inconsistency
  Decrements nextIndex for C to 4
  Asends another AppendEntries to C with entries starting from index 4
  This process continues until C log is consistent with A.

Determining term Inconsistencies in Log

Each heart beat contains a term PrevLogTerm which is used to compare and learn about the current term of the followers last log with the leader(F0) .

If there is a mismatch it indicates that the follower has the wrong log entries of some previous term which is not correct.

Example 3 has wrong term and entry and leader updates the logs

The below example illustrates this.

Initial state (Term 5)
  A (Leader): [1, 2, 3, 4, 5]
  B, C, D, E (Followers): [1, 2, 3, 4, 5]

Network partition occurs
  Partition 1 (minority): A, B
  Partition 2 (majority): C, D, E

In Partition 2
  C new leader for term 6
  C adds two new entries: [1, 2, 3, 4, 5, 6(t6), 7(t6)]
  D and E replicate these entries

In Partition 1:
  A adds two entries same time: [1, 2, 3, 4, 5, 6(t5), 7(t5)]
  B replicates these entries

Minority partition all nodes are disconnected and one node 
E from majority partition is disconnected.
The node E is added to the minority parition and all nodes are reconnected.

State of paritions 
  Partition 1 (old minority): A,B,E
  Partition 2 (old majority): C,D

New election occurs and A is leader in partition 1 again and E replicates
entries of A.
Now term of leader A is also 6 but its log have last term as 5.

Current state before partition heals:
  A,B,E: [1, 2, 3, 4, 5, 6(t5), 7(t5)] // Lower Log last term
  C,D: [1, 2, 3, 4, 5, 6(t6), 7(t6)] // Higher Log last term

Network partition heals:
C sends heartbeat to all nodes
A, B and E receive AppendEntries lastlog term (6)

Log inconsistency resolution
  A, B and E find that their entry at index 7 has the wrong term
 (5 instead of 6)
  They inform C of the inconsistency

C decreases match index for those nodes and send heartbeat again
  Includes entries [6(t6), 7(t6)] to replace incorrect entries at the end

A, B and E update their logs
  They replace their last two entries with the correct ones from C

Final state (all nodes):
  [1, 2, 3, 4, 5, 6(t6), 7(t6)]

Propagating the updated commit index to followers.

When a leader adds a new entry and propagates the entry to all followers and receives success replies and updates its commit index it must propagate that to the followers with every heartbeat.

This is done using the leader commitIndex in the heartbeats.

Updating the Commit Index

Leader

This is equally important for the leader. The logic to update the commit index for the leader is as follows

If there exists an N such that N > commitIndex and a majority of matchIndex[i] ≥ N and log[N].term == currentTerm: then set commitIndex = N

I had coded this logic wrong which lead to my leader not promoting commitIndex and waiting for 2 commits to realise that values had been committed. This lead to the first entry being missed and propagating and increasing the commitIndex for next all entries.

Let's consider a Raft cluster with 5 nodes . 

The current state is:
  currentTerm: 3
  commitIndex: 5
  Leader's log (A): [1(t1), 2(t1), 3(t2), 4(t2), 5(t2), 6(t3), 7(t3), 8(t3)]
  matchIndex for followers: [5(B), 7(C), 6(D), 4(E)]

The leader checks for the highest N that satisfies all conditions:
  N must be greater than commitIndex - 5
  N must be replicated to a majority - at least 3 out of 5 total nodes
  The log entry at index N must be from the current term -3

One thing we have to consider is that the leader will have this index already 
present.

The leader examines potential N values:
N = 6:
  6 > 5 (commitIndex) - satisfied
  3 peers A, B and C have matchIndex ≥ 6 - satisfied
  log.term == 3 (currentTerm) - satisfied

N = 7:
  7 > 5 (commitIndex) - satisfied
  2 peers A , C have matchIndex ≥ 7 - not a majority 

The leader determines that N = 6 is the highest value 
satisfying all conditions.

The leader updates its commitIndex to 6.

This logic must be checked on the leader every time a reply is received from a peer.

No other logic like frequency of the success replies for each log entry does not work.

Follower

When a follower receives a higher commit index on a heartbeat it must take the call to update its self commit index. The logic for it is as follows

If leaderCommit > commitIndex then set commitIndex = min(leaderCommit, index of last new entry)

Logic of Handling Election

When a peer receives an election entry then it has to evaluate a bunch of conditions to determine whether it must provide the vote or not.

Peer has more update logs

There are a bunch of conditions that can happen here which can make the peer reject the candidate.

Peer has the same last log term as the candidates but has a higher index compared to the candidates

Example is this

Candidate's log: [1(t1), 2(t2), 3(t3), 4(t3)]
Peer's log: [1(t1), 2(t2), 3(t3), 4(t3), 5(t3)]

Candidate and Peer have same last log term of 3 but peer has higher index

Peer has higher last log term compared to the candidates last log term

Example of this is

Candidate's log: [1(t1), 2(t2), 3(t3), 4(t3)]
Peer's log: [1(t1), 2(t2), 3(t3), 4(t4)]

Peer has a higher term in their last log entry even though index is same

Peer has higher Term

If a peer has a higher term then a candidate requesting for vote then the vote is rejected without any other check.

Candidate must promote itself to follower immediately and wait until election timeout of some other node starts.

Starting Index of Logs

Raft starts its log index at 1 and the code becomes much easier if this is followed in the scripting part. It is convenient to insert an empty log entry to force the main insertions at log index 1.

Unlike arrays which start at index 0 since Raft logs start at index 1 we put a nil entry at index 0 which makes the code so much more convenient.

I did not do this and it caused a lot of code changes later since everything depends on the index of the logs it was a bloodbath to bring everything working again after this huge change.

Conclusion

Compared to other projects the Raft algorithm is definetely one of the most challenging ones. However it is one of the coolest projects that I have done. This online free tips is something I wish I had when I started this project to prevent writing the wrong solution.

The Raft algorithm finds its value many places from ETCD in Kubernetes to Kafka and is definitely an inexpensive algorithm to know about which can be improved and personalized for large scale distributed systems.

If you find these quick tutorials and instructions useful then please consider subscribing and clapping for the post.