Distributed Systems Primitives • Consensus Algorithms (Raft, Paxos)Medium⏱️ ~3 min
Raft vs Multi Paxos: Architecture and Implementation Differences
Both Raft and Multi Paxos implement multi decree consensus to maintain a replicated, ordered log across unreliable nodes, but they differ significantly in structure and operational clarity. Multi Paxos emerges as an optimization of basic Paxos where a single stable leader handles many proposals after an initial election round, reducing steady state writes to one quorum round trip instead of two. However, Multi Paxos is underspecified in the original literature, requiring implementers to make many decisions about log management, leader election, and membership changes, leading to diverse implementations that are hard to compare.
Raft explicitly decomposes the problem into three first class concerns: leader election using randomized timeouts, log replication where the leader manages entries indexed by term and log position, and membership changes via joint consensus. The leader maintains authority by sending periodic heartbeats, typically at one tenth of the election timeout interval. Followers that do not receive heartbeats within the election timeout (commonly 150 to 300 milliseconds in wide area networks or 50 to 150 milliseconds in local area networks) start a new election by incrementing the term and requesting votes. This explicit structure makes Raft significantly easier to understand and implement correctly, which is why it has been widely adopted in systems like etcd, Consul, and CockroachDB.
Both algorithms enforce that only logs consistent with the current leader's term can be committed, and both require durable logging with fsync before acknowledging writes to ensure safety across power failures. The write path in steady state is nearly identical: leader appends entry locally, performs fsync, sends append messages to followers, and commits when a majority have persisted the entry. With proper pipelining and batching, both can achieve similar throughput, typically in the low thousands of queries per second sustained for systems like etcd to avoid exhausting compaction and election headroom.
💡 Key Takeaways
•Multi Paxos reduces steady state writes to one quorum round trip after initial leader election, but its underspecification leads to implementation divergence and operational complexity
•Raft explicitly structures leader election with randomized timeouts, log replication with term and index pairs, and membership changes as joint consensus, making it significantly more implementable
•Leaders send heartbeats at approximately one tenth of the election timeout to maintain authority; typical timeouts are 150 to 300 milliseconds for wide area networks and 50 to 150 milliseconds for local area networks
•Both algorithms require durable fsync operations before acknowledging writes, with fsync latency on the leader (0.2 to 2 milliseconds on NVMe) being a major component of overall commit latency
•Practical sustained throughput for production systems like etcd is kept in the low thousands of queries per second to preserve headroom for background compaction, snapshot generation, and election stability
•Google Spanner runs thousands of Paxos groups with leaders placed near write traffic, demonstrating horizontal scaling by sharding across many independent consensus groups rather than trying to scale a single group
📌 Examples
Google Spanner uses Paxos per shard called Paxos groups, typically with 5 replicas across 3 or more regions. Regional configurations can commit in 5 to 15 milliseconds for intra region round trip times plus fsync, while multi region commits take 50 to 200 milliseconds depending on distance, such as 70 to 100 milliseconds US coast to coast.
CockroachDB implements Raft per range with 3 or 5 replicas, achieving 2 to 10 milliseconds intra region quorum writes and 50 to 200 milliseconds cross region. Throughput scales with the number of ranges and leaders, but individual hot ranges remain constrained by single leader bottlenecks.
HashiCorp Consul and Vault use Raft with 3 to 5 nodes for their control plane, committing linearizable writes in tens of milliseconds within a region for configuration, secrets, and service discovery metadata.