Replication & Consistency • Quorum ReplicationHard⏱️ ~3 min
Failure Modes in Quorum Systems: Network Partitions, Conflicts, and Tail Latency Amplification
Network partitions and availability zone outages are the most critical failure mode in quorum systems because they can render operations unavailable despite replicas being healthy. With n = 3 and w = 2, if two availability zones become unreachable (for example, due to network misconfiguration or correlated power failure), write operations fail because the remaining replica cannot form a write quorum. Similarly, with r = 2, reads fail when only one replica is reachable. In production, this drives the design choice of n = 3 across availability zones rather than n = 3 within a single zone: you trade higher cross AZ network latency (1 to 2 millisecond median instead of sub millisecond) for tolerance of entire zone failures. Sloppy quorum is a mitigation technique used by Amazon Dynamo where writes temporarily go to fallback nodes outside the preferred replica set when primary replicas are unreachable, maintaining write availability at the cost of increased reconciliation complexity through hinted handoff.
Conflicting concurrent writes are an inherent challenge in leaderless quorum systems. When two coordinators independently collect write quorums for different versions of the same key, conflicts arise that must be resolved. Vector clocks can detect these conflicts by tracking causality, but require application level merge logic. In practice, vector clock size must be bounded (typically 10 to 20 entries) by pruning old entries, which can lose causality information and miss conflicts. Last write wins conflict resolution based on timestamps is simpler but dangerous: with typical NTP clock skew of tens of milliseconds, concurrent writes arriving within this window can be incorrectly ordered, potentially dropping updates. Amazon Dynamo handles shopping cart conflicts by merging (unioning) all concurrent versions, accepting the complexity of occasionally showing a customer an item they thought they removed if concurrent updates occurred.
Tail latency amplification is a fundamental challenge when waiting for multiple replicas. Because quorum operations must wait for w acknowledgments on writes or r responses on reads, the operation latency is determined by the slowest replica in the quorum, not the median. With cross availability zone p99 RTT of 5 to 10 milliseconds and occasional slowness from garbage collection pauses (100 to 500 milliseconds), disk stalls, or network retransmissions, quorum operations can see p99 latencies of tens to hundreds of milliseconds without mitigation. Hedged requests reduce this by sending duplicate requests after a delay (for example, after p95 latency threshold), but at the cost of increased backend load. Dynamic replica selection (snitching) tracks per replica latency percentiles and avoids consistently slow replicas, but requires careful implementation to avoid cascading failures when excluding too many replicas reduces available quorum options.
💡 Key Takeaways
•With n = 3 and w = 2, a two availability zone outage makes writes unavailable despite one healthy replica. This drives deployment across at least three independent failure domains (AZs) rather than within a single datacenter.
•Sloppy quorum allows writes to temporary fallback nodes when primary replicas are unreachable, maintaining availability but creating hinted handoff replay debt. Prolonged outages can accumulate gigabytes of hints causing replay storms (saturating disks at hundreds of MB/s) on recovery.
•Clock skew of tens of milliseconds (typical with NTP) can cause last write wins to incorrectly order concurrent writes arriving within the skew window. This can silently drop updates in high write rate scenarios (thousands of writes per second to the same key).
•Quorum tail latency is determined by the slowest replica. With cross AZ p99 RTT of 5 to 10 milliseconds normally, but occasional 100 to 500 millisecond garbage collection pauses, unmitigated p99 can spike to hundreds of milliseconds.
•Read staleness occurs when r + w ≤ n or due to sloppy quorum delays. With n = 3, w = 2, r = 1, reads can miss writes because 2 + 1 = 3 (not greater than 3). Read repair mitigates by writing back to stale replicas but can spike latency when copying large values (megabytes).
•Membership changes during reconfigurations can violate quorum intersection if not coordinated properly. Removing replica A and adding replica D concurrently can create disjoint old (A,B,C) and new (D,B,C) replica sets where w = 2 writes to A,B and reads from D,C miss each other.
📌 Examples
In September 2015, Amazon DynamoDB had an availability event when network connectivity loss between availability zones caused quorum operations to fail for tables configured with strongly consistent reads, demonstrating the partition sensitivity of strict quorum systems.
Cassandra hinted handoff can accumulate tens of gigabytes of hints during prolonged node outages (hours to days). On recovery, hint replay can saturate the returning node disk bandwidth (hundreds of MB/s), causing 10x to 100x latency increases for foreground operations until replay completes.
Vector clock size in production Riak clusters is typically limited to 10 to 20 entries. When exceeded, the system prunes the oldest entries, which can cause sibling versions to appear distinct when they actually have causal relationships, requiring unnecessary application merges.
Cross region quorum is impractical due to 50 to 150 millisecond inter region RTTs between continents. Amazon DynamoDB Global Tables use per region quorums with asynchronous cross region replication, accepting seconds of inconsistency windows and last writer wins conflict resolution.