Consensus Failure Modes and Production Operational Challenges

Consensus systems exhibit several critical failure modes that can violate safety or liveness if not properly handled. The most catastrophic is log divergence due to incorrect term and index handling or non durable writes, which can cause different nodes to commit conflicting values at the same log position. Power loss during writes is particularly dangerous: if fsync completes but the operating system reorders or loses log entries on restart, safety is violated. Production implementations must use write ahead logs with checksums, ensure strict fsync discipline where each entry is durably persisted before acknowledgment, and validate log integrity on restart. Google systems rely on carefully engineered storage stacks that guarantee write ordering and durability even across power failures.

Election instability creates a different class of problems. Long garbage collection pauses of multiple seconds, disk stalls from slow fsyncs, or network jitter can cause followers to timeout and start elections even when the leader is functioning. This leads to election storms where repeated split votes prevent any candidate from achieving majority, halting all writes. The solution involves several techniques: randomized election timeouts to reduce vote splitting probability, pre vote extensions where candidates check if they could win before incrementing the term and disrupting the current leader, and leader leases that prevent elections while the lease is valid. Modern systems also carefully tune garbage collection to avoid pauses exceeding one tenth of the election timeout and monitor fsync latency to detect degrading disks before they cause timeouts.

Reconfiguration represents one of the most operationally dangerous procedures. Without proper overlap guarantees, removing a node and then adding a replacement can create two disjoint majorities operating independently, a split brain scenario that violates safety. For example, in a 5 node cluster, removing 2 nodes leaves 3, then if you add 2 new nodes before they fully replicate, you might have the old 3 and new 2 forming separate majorities if partitioned. Raft addresses this through joint consensus where the system temporarily requires majorities from both the old and new configurations, ensuring no single group can make decisions alone during the transition. Common operational mistakes include removing multiple nodes simultaneously during maintenance windows or performing reconfigurations during network instability.

💡 Key Takeaways

✓Log divergence from incorrect term and index handling or fsync failures can cause catastrophic safety violations where different nodes commit conflicting values at the same position

✓Garbage collection pauses exceeding one tenth of the election timeout, typically 15 to 30 milliseconds for 150 to 300 millisecond timeouts, trigger spurious elections and can create election storms with repeated split votes

✓Split brain during reconfiguration occurs when removing nodes before adding replacements creates two disjoint majorities; joint consensus requires majorities from both old and new configurations to prevent this

✓Slow minority members do not block commit but can impact throughput through backpressure, especially during snapshot transfers; production systems apply flow control and throttle snapshots to stragglers

✓Lease based reads require trustworthy clocks with bounded skew; excessive clock drift beyond the lease duration can violate linearizability by serving stale data from a deposed leader still holding an expired lease

✓Large log entries of multiple megabytes dramatically increase p99 commit latency and can trigger follower timeouts; production systems either reject large entries or implement chunking with back references

📌 Interview Tips

1A production Kubernetes cluster experienced election storms when a node's SSD began experiencing intermittent 200 millisecond fsync stalls. This exceeded the 150 millisecond election timeout, causing followers to repeatedly start elections. The cluster spent 40% of time in election state, degrading write availability. Monitoring fsync p99 latency and alerting on values exceeding 50 milliseconds would have caught the failing disk earlier.

2An operator performed an unsafe reconfiguration on a 5 node Consul cluster by removing 2 nodes for hardware replacement, leaving 3 nodes. Before the new nodes fully joined, a network partition isolated 2 of the remaining 3 nodes with the 2 new partially initialized nodes. The old 2 plus new 2 temporarily believed they had quorum, creating a split brain. This was avoided in later procedures by using joint consensus and only removing one node at a time.

3Google Chubby uses conservative timeouts and lease mechanisms to avoid split brain. Clients cache locks with leases, and master failover requires both a new election and waiting for all old leases to expire, often taking several seconds. This deliberate delay prevents two masters from granting conflicting locks to different clients during a partition.

← Back to Consensus Algorithms (Raft, Paxos) Overview