Distributed Systems PrimitivesLeader ElectionEasy⏱️ ~3 min

What is Leader Election and Why It Matters

Leader election is a distributed systems primitive that ensures exactly one node acts as the coordinator for a group of replicas or a partition at any given time. The leader serializes writes, assigns work, and serves as the single authority for ordering and conflict resolution. This simplifies correctness by creating a clear decision point: clients route writes to the leader, followers replicate and apply the leader's decisions, and the system uses majority quorums to guarantee safety despite node failures. Two universal ingredients appear in all production leader election systems: majority quorums and time based mechanisms. Quorums prevent split brain scenarios by requiring a majority (for example, 3 out of 5 nodes) to elect or recognize a leader. Time based mechanisms such as heartbeats, leases, and randomized timeouts detect failures and prevent overlapping leadership. Correct systems combine a monotonically increasing epoch or term number that increments with each election, along with fencing tokens or leases to ensure an old leader cannot continue acting if it returns after a pause or network partition. Leader election is only one part of a larger control loop. You also need failure detection through heartbeats or accrual detectors, agreement on a unique leader via protocols like Raft or Paxos, safe ownership transfer using fencing or leases, and convergence where followers catch up before serving reads or writes requiring freshness. Tuning timeouts involves trading availability for safety: shorter timeouts reduce failover time but increase false positives under slow networks or garbage collection (GC) pauses, while longer timeouts reduce false elections but extend unavailability during real failures.
💡 Key Takeaways
Leader election guarantees exactly one coordinator node at a time, simplifying write ordering and conflict resolution by creating a single authority
Majority quorums prevent split brain by requiring more than half the nodes (for example, 3 of 5) to agree on leadership
Monotonically increasing epoch or term numbers combined with fencing tokens prevent old leaders from corrupting state after network partitions or long pauses
Heartbeats detect failures, typically sent every 50 to 100 milliseconds (ms) in Local Area Network (LAN) environments with election timeouts 5 to 10 times longer
Timeout tuning trades availability for safety: shorter timeouts (200 to 300 ms) enable faster failover but risk false elections during transient slowness, longer timeouts (6 to 20 seconds) reduce false positives but extend unavailability windows
📌 Examples
Google Spanner uses Paxos leader leases of approximately 10 seconds per replica group, achieving write latencies of tens of milliseconds within a region and completing leader changes in a few seconds
Apache Kafka with KRaft mode (Raft based) uses election timeouts around 200 to 1,000 ms, achieving controller failovers in sub second to a few seconds compared to the several second failovers with ZooKeeper based controller election
Kubernetes etcd commonly runs with 100 ms heartbeat intervals and 1,000 ms election timeouts in LAN, resulting in leader failover typically completing in 1 to 3 seconds with 3 to 5 nodes
← Back to Leader Election Overview