Distributed Systems Primitives • Consensus Algorithms (Raft, Paxos)Medium⏱️ ~3 min
When to Use Consensus vs Alternative Replication Strategies
Consensus algorithms are the correct choice when you require linearizable writes and reads with strong safety guarantees, such as financial ledgers where duplicate or conflicting transactions are unacceptable, control planes managing cluster membership and leader election, and lock services coordinating access to shared resources. Google Chubby exemplifies this pattern, using Multi Paxos to provide distributed locks for Bigtable, GFS, and Borg, accepting tens of milliseconds latency and seconds of failover time to guarantee that no two clients ever believe they hold the same lock. The critical insight is that consensus trades availability during partitions and higher write latency for absolute correctness, making it essential when safety cannot be compromised.
However, many applications can achieve better availability and latency by relaxing consistency requirements. Read heavy workloads with infrequent updates and tolerance for bounded staleness should use asynchronous replication with primary backup patterns, where a primary accepts writes and asynchronously propagates to read replicas. This allows read replicas to serve traffic during write path failures and reduces read latency by placing replicas near users, at the cost of serving data that may be seconds to minutes behind. For workloads requiring writes to continue during network partitions, such as multi region collaborative editing or shopping carts, Conflict free Replicated Data Types (CRDTs) or application level conflict resolution with last write wins semantics provide better availability than consensus, though they require careful design to handle conflicting concurrent updates.
The geographic requirements also heavily influence the decision. For applications requiring sub 10 millisecond global writes, consensus is physically impossible due to speed of light constraints; even US coast to coast requires 70 to 100 milliseconds for the round trip time alone. In these cases, techniques like asynchronous multi primary replication with conflict detection, session affinity to route users to nearby regions with eventual cross region propagation, or carefully designed CRDTs that merge concurrent updates deterministically are necessary. Google Spanner accepts 50 to 200 milliseconds cross region write latency to provide strong consistency, which is appropriate for financial and inventory systems where correctness outweighs latency, but would be unsuitable for user facing features requiring immediate feedback.
💡 Key Takeaways
•Use consensus for linearizable writes and reads where safety is non negotiable: financial ledgers, control planes, lock services, and distributed coordination requiring strict ordering
•Consensus halts writes during network partitions when no majority exists, making it unsuitable for applications requiring write availability in all partitions or sub 10 millisecond global writes
•Asynchronous replication with primary backup provides better read availability and latency for read heavy workloads with infrequent updates, accepting eventual consistency measured in seconds to minutes
•Conflict free Replicated Data Types (CRDTs) or last write wins strategies enable writes during partitions for use cases like collaborative editing and shopping carts, trading strong consistency for availability
•Geographic constraints dictate feasibility: single region consensus achieves 5 to 15 milliseconds commits but risks regional failure, while global consensus requires 50 to 200 milliseconds making it unsuitable for latency sensitive user interactions
•Single writer primary with synchronous standbys offers a simpler operational model than full consensus while still providing quorum like durability guarantees, appropriate when failover complexity can be externalized
📌 Examples
Google Spanner uses Paxos for transactional SQL databases requiring strict consistency across regions, accepting 50 to 200 milliseconds write latency. In contrast, Google's frontend serving systems use asynchronous replication and caching with bounded staleness to achieve single digit millisecond response times globally, accepting that users might see slightly stale data.
Kubernetes uses etcd with Raft for its control plane metadata because correctness is paramount: scheduling the same pod to two nodes or missing a pod deletion would cause severe issues. However, application logs and metrics use eventually consistent systems because temporary inconsistency is acceptable and ultra high availability is more valuable.
Amazon DynamoDB offers both strongly consistent reads via quorum and eventually consistent reads with lower latency and higher availability. The eventually consistent mode allows reads to succeed from any replica even during partitions, serving data that may be a few hundred milliseconds to seconds stale, which is acceptable for product catalogs and recommendations but not for inventory counts during checkout.