Design FundamentalsCAP TheoremHard⏱️ ~3 min

Implementation Patterns: Quorums, Gossip, and Multi Region Topology

Quorum based replication is the foundational pattern for both CP and tunable systems. Choose replication factor N (commonly 3 or 5) and read/write quorums R and W such that R + W > N to guarantee overlap (strong consistency). Common single region setup: N=3, R=2, W=2 provides fault tolerance (survives one node loss) and linearizability if you enforce quorum reads after writes. The latency cost is waiting for the slowest of 2 replicas; techniques like hedged requests (send duplicate request after p50 latency) or speculative retries cut p99 tails by 30 to 50 percent. CP systems use consensus protocols (Paxos or Raft) for log replication and leader election. Each write appends to a replicated log after majority acknowledgment, adding one to two round trips plus durable logging (fsync or SSD write, typically 1 to 5ms). Leaders hold time bounded leases (renewed via heartbeat) to prevent split brain; losing majority means losing the ability to renew leases, forcing unavailability. Google Spanner places replicas across zones and regions, with Paxos groups choosing leaders near majority of users to minimize write latency. Cross region writes require at least one inter continental round trip (100 to 200ms) plus commit wait, making them unsuitable for user facing writes unless confined to regional scopes. AP systems use gossip (epidemic broadcast) and anti entropy (Merkle tree comparison) to detect and repair divergence. Cassandra nodes gossip cluster state every second; hinted handoff queues writes destined for temporarily unavailable replicas, delivering them when the node returns. Read repair reconciles on read (if CL QUORUM detects mismatched replicas), and scheduled repair runs Merkle tree comparisons to find and fix silent divergence. The cost is background bandwidth and CPU, plus the window of inconsistency (seconds to minutes depending on load and repair schedule).
💡 Key Takeaways
Quorum with R + W > N guarantees strong consistency via overlap; typical N=3, R=2, W=2 survives one failure with 1 to 5ms intra region latency, but tail latency is bounded by slowest replica in quorum
Hedged requests and speculative retries (send duplicate after p50 latency) reduce p99 tail by 30 to 50 percent in quorum systems by racing replicas, at the cost of 1.5x to 2x backend load
Consensus (Paxos/Raft) for CP adds one to two round trips per write plus durable log append; Google Spanner uses Paxos groups with leaders near user traffic to minimize cross region latency
Multi region CP requires at least one inter continental round trip (100 to 200ms); confine strongly consistent writes to regional scope and use async replication for reads to avoid user facing latency hits
AP gossip and anti entropy repair divergence over seconds to minutes; Cassandra hinted handoff queues writes for downed nodes, but silent data loss can occur if hints expire (default 3 hours) before node returns
📌 Examples
Google Spanner for AdWords: place Paxos leader in US East where most bidding traffic originates; reads within US East are 1 to 5ms, writes are 5 to 10ms including commit wait; cross region reads use nearest replica with bounded staleness
Cassandra at Netflix: N=3, CL ONE for writes (single digit ms, always available), scheduled repair every 10 days with Merkle tree comparison to detect silent divergence in petabyte scale data
Amazon DynamoDB: replicates across 3 availability zones with intra region latency under 5ms; Global Tables use asynchronous cross region replication taking hundreds of milliseconds to seconds depending on data size and distance
Hedged requests in Google services: after p50 latency (say 5ms), send duplicate request to another replica; first response wins, reducing p99 from 20ms to 8ms while increasing backend queries per second by roughly 50 percent
← Back to CAP Theorem Overview