Distributed Systems Primitives • Leader ElectionHard⏱️ ~3 min
Preventing Split Brain: Quorums, Fencing, and Epochs
Split brain occurs when network partitions or failures cause two subclusters to both believe they have valid leaders, resulting in divergent histories that violate consistency guarantees. Preventing split brain requires multiple coordinated mechanisms working together. The foundation is majority quorums: requiring more than half the nodes to agree on leadership means at most one partition can form a majority. For example, in a 5 node cluster, you need 3 votes to elect a leader, ensuring that two separate partitions (say 2 and 3 nodes) cannot both elect leaders simultaneously.
However, quorums alone are insufficient because of timing edge cases. A leader may experience a long garbage collection pause or transient network partition, lose its lease, and then resume serving traffic believing it is still the leader. The second critical ingredient is fencing tokens or leader epochs combined with strict validation. Every state changing operation must include a monotonically increasing epoch number that is persisted and validated by followers and storage systems. Downstream systems must reject operations with stale epochs. For example, if a paused leader with epoch 5 wakes up and attempts writes after a new leader with epoch 6 has been elected, followers and storage nodes reject the epoch 5 operations, preventing corruption.
Production systems layer multiple safeguards. Google Chubby and Spanner use Paxos leader leases validated by quorums, where the leader must successfully renew its lease with a majority before the expiration time, and operations are tagged with lease sequence numbers that storage nodes validate. Apache HDFS NameNode High Availability (HA) uses ZooKeeper based fencing locks combined with shared storage fencing: when a new NameNode becomes active, it must acquire the lock and fence the previous NameNode's access to shared edit logs, preventing the old NameNode from continuing writes even if it remains running. The failure mode this guards against is particularly insidious: without fencing, a paused leader can wake up, pass all its local checks (believing its lease is valid due to clock skew), and corrupt state with operations that conflict with the new leader's decisions.
💡 Key Takeaways
•Majority quorums ensure at most one partition can elect a leader: in a 5 node cluster requiring 3 votes, two separate partitions cannot both achieve majority, preventing simultaneous leadership
•Fencing tokens or epoch numbers must be monotonically increasing, persisted, and validated on every state changing operation so followers and storage reject stale leaders resuming after pauses or partitions
•Clock skew with time based leases creates risk: a leader with a slow clock may believe its lease is valid when others consider it expired, requiring quorum based lease validation rather than trusting a single node's local clock
•Google Spanner combines Paxos leader leases (approximately 10 seconds) with TrueTime to bound clock uncertainty to microseconds, ensuring safe lease management even with clock drift across distributed nodes
•HDFS NameNode HA uses ZooKeeper fencing locks plus shared storage fencing to prevent a paused previous NameNode from writing to edit logs after a new NameNode activates, protecting against split brain corruption
📌 Examples
A Kafka broker experiencing a 15 second garbage collection pause loses its ZooKeeper session (6 second timeout), a new controller is elected, but the paused broker wakes up and attempts partition leadership operations. Without epoch validation, this causes inconsistent replica state; with epoch validation, followers reject the stale epoch operations.
Amazon Aurora writer promotion requires quorum verification against the 6 way replicated storage layer across 3 AZs, ensuring a promoted writer cannot conflict with a paused previous writer that resumes after a network partition heals
etcd (Raft) uses term numbers: a candidate increments the term when starting an election, and followers reject append entries Remote Procedure Calls (RPCs) from leaders with stale terms, preventing an old leader partitioned and rejoining from corrupting the log