Distributed Systems Primitives • Leader ElectionHard⏱️ ~3 min
Failure Modes and Edge Cases in Leader Election
Leader election systems face numerous subtle failure modes that can violate safety or liveness guarantees if not properly handled. Log divergence during unsafe leader promotion is particularly dangerous: if you promote a follower missing committed entries to leadership, those entries are lost forever. For example, in a 3 node Raft cluster where the leader commits entry X to itself and one follower then crashes before the third follower receives X, an election restricted only by term numbers might elect the third follower (which lacks X) as the new leader, causing data loss. The mitigation is election restriction: candidates must have logs at least as up to date as a majority of voters, which Raft enforces by comparing last log term and index during voting.
Thundering herd and election storms occur when multiple candidates start elections simultaneously, causing repeated split votes that prevent convergence. Without randomization, nodes with similar failure detection timeouts all become candidates at once, each voting for themselves and failing to achieve majority. This can persist for tens of seconds or longer, causing extended unavailability. The solution is randomized election timeouts: for example, Raft recommends choosing timeouts randomly from a range (say 150 to 300 ms), with exponential backoff and jitter after failed elections. Empirically, a 2 times range (for example, 150 to 300 ms) provides enough spread to avoid storms while keeping failover bounded.
Dependency failures create subtle cascading issues. If an external coordinator like ZooKeeper becomes unavailable or experiences high tail latencies, leadership cannot change or is delayed even when the current leader is clearly failed. For example, if ZooKeeper quorum latencies spike to seconds during a deployment or network event, applications relying on ZooKeeper for leader election are stuck with a failed leader until ZooKeeper recovers, extending outages. The mitigation is to treat the coordinator as critical infrastructure: run redundant well provisioned ensembles, monitor p99 latencies closely, isolate traffic classes to prevent noisy neighbors, and consider in protocol election to eliminate the external dependency. Membership changes during election present another edge case: changing the voter set (adding or removing nodes) while an election is in progress can deadlock or break quorum safety, requiring two phase membership changes (joint consensus) where the system briefly requires majorities in both the old and new configurations.
💡 Key Takeaways
•Log divergence and unsafe promotion causes data loss if a follower missing committed entries becomes leader. Mitigation requires election restriction: Raft candidates must have logs at least as up to date as a majority, rejecting stale candidates to preserve committed entries
•Thundering herd and election storms from simultaneous candidacy cause repeated split votes preventing convergence for tens of seconds. Randomized election timeouts (for example, 150 to 300 ms range) with exponential backoff and jitter resolve storms by spreading candidate start times
•External coordinator failures (ZooKeeper unavailable or high latency) prevent leadership changes even when the current leader is clearly failed, extending outages. Mitigation requires redundant well provisioned coordinator ensembles and monitoring p99 latencies, or eliminating the dependency via in protocol election
•Membership changes during election can deadlock or break quorum safety if nodes cannot agree on the voter set. Two phase membership changes (joint consensus) require majorities in both old and new configurations simultaneously during the transition, preventing split quorums
•Slow or asymmetric networks with one way packet loss cause false suspicions and unnecessary elections. Accrual failure detectors (phi accrual) adapt thresholds based on observed latency distributions, reducing false positives compared to fixed timeout detectors
📌 Examples
A 5 node Raft cluster with 200 ms election timeout where all nodes detect leader failure within 10 ms of each other: without randomization, all 5 become candidates simultaneously, each votes for itself, no majority forms, and the split vote repeats every 200 ms. With randomized 150 to 300 ms timeouts, nodes stagger candidacy and one achieves majority within 200 to 400 ms
HDFS NameNode HA during a ZooKeeper deployment that spikes latencies to 5 seconds: the active NameNode crashes but the standby cannot acquire the fencing lock due to ZooKeeper slowness, extending NameNode unavailability to 5 plus seconds despite a healthy standby
A Kafka cluster attempting to remove a broker while a controller election is in progress: the old controller believes the configuration includes the old broker set, the new candidate sees the new configuration, neither can form a majority in a single configuration, causing deadlock until timeout and retry with stable configuration