Learn→Distributed Systems Primitives→Consensus Algorithms (Raft, Paxos)→3 of 6

Distributed Systems Primitives • Consensus Algorithms (Raft, Paxos)Medium⏱️ ~3 min

Consensus Performance: Latency, Throughput, and Geographic Trade-offs

Consensus system performance is fundamentally bounded by three factors: quorum round trip time to the slowest majority member, durable storage fsync latency on the leader, and leader CPU and disk input output capacity. In a single availability zone deployment with 3 replicas on NVMe storage and sub 1 millisecond network round trip times, typical p50 commit latency ranges from 2 to 6 milliseconds, composed of approximately 0.2 to 2 milliseconds for leader fsync, 0.5 to 1 millisecond for network transmission, and 0.2 to 2 milliseconds for follower fsync, plus protocol overhead. Moving to 5 replicas across multiple zones with 1 to 2 milliseconds inter zone latency increases this to 5 to 15 milliseconds as you must wait for the third acknowledgment which experiences higher round trip times.

Geographic placement creates stark trade-offs between latency, availability, and cost. Single region deployments provide the lowest latency but are vulnerable to regional failures that halt all writes when majority is lost. Multi zone configurations within a region add only 1 to 2 milliseconds round trip time while protecting against zone failures, offering an excellent balance for most applications. Cross region deployments sacrifice write latency, with US coast to coast adding 70 to 100 milliseconds and transoceanic links adding 150 to 250 milliseconds, but provide resilience against entire region outages. Google Megastore accepts 100 millisecond plus synchronous cross datacenter commits by using Paxos per small entity group to limit contention and optimizing for read heavy workloads.

Throughput scaling requires careful attention to leader bottlenecks and batching strategies. A single Raft or Paxos group is limited by leader CPU for marshaling and validation, disk input output for fsync operations, and network bandwidth to followers. Production systems like Google Spanner scale by sharding data across thousands of independent Paxos groups, each with its own leader, rather than trying to scale a single consensus group. Within a group, aggressive batching of small entries amortizes fsync cost but must be capped to prevent tail latency spikes; batches should typically be committed within 10 to 50 milliseconds to keep p99 latency acceptable.

💡 Key Takeaways

✓Single availability zone deployments with 3 replicas and NVMe achieve 2 to 6 milliseconds p50 commit latency, while 5 replicas across zones see 5 to 15 milliseconds due to waiting for slower majority members

✓Leader fsync latency of 0.2 to 2 milliseconds on NVMe is a critical component of overall commit time, and can dominate in local area network deployments with sub millisecond round trip times

✓Cross region consensus adds 50 to 200 milliseconds commit latency depending on distance: 70 to 100 milliseconds US coast to coast, 150 to 250 milliseconds transoceanic, making it unsuitable for latency sensitive applications

✓Horizontal throughput scaling requires sharding across multiple independent consensus groups with separate leaders; Google Spanner runs thousands of Paxos groups rather than scaling a single group

✓Cross region replication at 10 megabytes per second sustained generates approximately 864 gigabytes daily egress, costing tens to hundreds of dollars per day depending on cloud provider and region pairs

✓Batching small entries amortizes fsync overhead to increase throughput, but batch windows must be capped at 10 to 50 milliseconds to avoid p99 latency degradation that triggers client timeouts

📌 Interview Tips

1Google Spanner places leaders near write traffic within a Paxos group. Regional configurations achieve 5 to 15 milliseconds commits, suitable for transactional workloads, while global configurations accept 50 to 200 milliseconds for stronger durability. TrueTime bounded uncertainty windows enable externally consistent transactions and linearizable reads without contacting followers.

2Kubernetes etcd clusters are deliberately kept within a single region or low latency zone group. With 3 to 5 members, NVMe storage, and 1 to 2 milliseconds round trip times, they sustain low thousands of queries per second with 5 to 25 milliseconds p50 and under 100 milliseconds p99. Operators explicitly avoid cross region etcd because 50 millisecond plus round trip times destabilize leader elections and inflate commit times.

3A real world failure scenario: An operator removes a node from a 3 node cluster to upgrade hardware, temporarily running on 2 nodes. A second node then fails due to unrelated disk issues, leaving only 1 of 3 nodes operational. Since this is not a majority, all writes halt immediately despite having one healthy node, demonstrating how quorum requirements trade availability for safety.

← Back to Consensus Algorithms (Raft, Paxos) Overview