Learn→Distributed Systems Primitives→Consensus Algorithms (Raft, Paxos)→2 of 6

Distributed Systems Primitives • Consensus Algorithms (Raft, Paxos)Medium⏱️ ~3 min

Raft vs Multi Paxos: Architecture and Implementation Differences

Raft is a consensus algorithm designed explicitly for understandability and practical implementation. It decomposes the consensus problem into three modular concerns. First, leader election uses randomized timeouts where followers wait 150 to 300 milliseconds in wide area networks (or 50 to 150 milliseconds in local area networks) before starting an election by incrementing the term number and requesting votes. Second, log replication has the leader append entries indexed by term and position, sending them to followers who must match the previous log entry before accepting new ones. Third, membership changes use joint consensus where the cluster temporarily operates under both old and new configurations. Leaders maintain authority by sending heartbeats at approximately one tenth of the election timeout interval. This explicit structure makes Raft significantly easier to implement correctly, which is why systems like etcd, Consul, and CockroachDB have adopted it.

Multi Paxos evolved as an optimization of basic Paxos for handling multiple decisions (a replicated log) rather than single value consensus. After an initial leader election round, a stable leader can propose many values with just one quorum round trip instead of two, making steady state operation efficient. Each proposal gets assigned a sequence number and ballot number, and acceptors promise not to accept proposals with lower ballot numbers. However, the original Paxos literature underspecifies critical implementation details such as how to manage the log structure, handle leader failure and re-election, perform log compaction, and manage membership changes. This leaves implementers to make many design choices independently, resulting in diverse implementations that are difficult to compare or reason about operationally.

The key differences lie in specification clarity and operational characteristics. Raft provides explicit algorithms for all aspects of consensus, including precisely how leaders are elected, how log entries are replicated and committed, and how cluster membership evolves. Multi Paxos achieves similar performance in steady state but requires implementers to fill in substantial gaps. Both require durable fsync operations before acknowledging writes to ensure safety across failures, with fsync latency on NVMe drives (0.2 to 2 milliseconds) being a major component of commit latency. Both can achieve similar throughput with proper pipelining and batching, typically sustaining low thousands of queries per second in production to preserve headroom for background operations like compaction and snapshot generation. The practical difference is that Raft's explicit structure significantly reduces implementation complexity and operational surprises.

💡 Key Takeaways

✓Raft explicitly decomposes consensus into three modular concerns: leader election with randomized timeouts (150 to 300 milliseconds for WAN), log replication with term and index pairs, and membership changes via joint consensus

✓Multi Paxos optimizes basic Paxos to one quorum round trip per proposal after initial leader election, but underspecification in the literature leads to implementation divergence and requires filling substantial gaps

✓Both algorithms require durable fsync operations before acknowledging writes, with NVMe fsync latency (0.2 to 2 milliseconds) being a major component of overall commit latency in the write path

✓Practical production throughput is kept in the low thousands of queries per second to preserve headroom for background operations like compaction, snapshot generation, and maintaining election stability

📌 Interview Tips

1etcd implements Raft with 3 to 5 nodes and achieves intra region quorum writes in 2 to 10 milliseconds including fsync latency. Leaders send heartbeats every 50 milliseconds with a 500 millisecond election timeout, and the system sustains 2000 to 3000 writes per second to maintain stability margins for compaction and leader election.

2CockroachDB uses Raft per range with 3 or 5 replicas, committing writes in 2 to 10 milliseconds within a region and 50 to 200 milliseconds cross region depending on distance. Each range has an independent leader, allowing throughput to scale horizontally, though individual hot ranges remain bottlenecked by their single leader's write capacity.

← Back to Consensus Algorithms (Raft, Paxos) Overview