Failover, Split Brain, and Leader Election Mechanics

When the leader fails or becomes unreachable due to network partition or long Garbage Collection (GC) pauses, the system must elect a new leader to restore write availability. The danger is split brain: two nodes simultaneously believing they are leader, accepting divergent writes, and corrupting data. Preventing split brain requires strict coordination via consensus protocols like Raft or Paxos that use terms or epochs (monotonically increasing integers) and quorums. A candidate can only become leader by receiving votes from a majority of nodes in the current term, and any node that has already voted in that term will reject subsequent candidates. The old leader, when it recovers or the partition heals, will see the higher term and immediately step down, refusing further writes. Systems without proper consensus may use fencing tokens or time based leases, but these are vulnerable to clock skew and long pauses.

Failover latency has two components: detection time and election plus catch up time. Aggressive failure detection timeouts (1 to 3 seconds) yield faster recovery but risk false positives from transient network hiccups or GC pauses, leading to unnecessary leadership thrashing. Conservative timeouts (10 to 30 seconds) reduce false failovers but extend write unavailability. Real world deployments typically target 3 to 10 seconds for detection to balance these concerns. After detecting failure, the election itself may take another few seconds (Raft default election timeout is often 150 to 300 milliseconds, but with network retries and multiple rounds it can stretch longer), and then clients must discover the new leader and reroute, adding several more seconds. Total observed failover times in production systems like MongoDB often land in the 10 to 30 second range including all phases.

A critical but often overlooked failure mode is the new leader being behind the old leader when elected, which can cause already acknowledged writes to disappear. Asynchronous replication allows this: the old leader acknowledged writes locally and then crashed before replicating them. The solution is to either use synchronous or majority commit so the new leader (elected from the majority) is guaranteed to have all committed writes, or accept potential data loss and implement conflict resolution when the old leader rejoins. Log divergence must be resolved by truncating the old leader's uncommitted tail or using vector clocks and reconciliation, adding operational complexity. Google Spanner avoids this entirely by using Paxos groups where only a replica with all committed entries can become leader.

💡 Key Takeaways

✓Split brain occurs when network partition or pauses cause two nodes to claim leadership simultaneously, resulting in divergent writes and corruption. Prevented by consensus protocols with terms and quorum votes; Raft and Paxos guarantee only one leader per term.

✓Failover latency equals detection time (3 to 10 seconds typical) plus election (1 to 5 seconds) plus client discovery (2 to 5 seconds), totaling 6 to 20 seconds or more in production. MongoDB replica sets commonly see 10 to 30 second failover times end to end.

✓Aggressive timeouts (1 to 3 seconds) enable fast recovery but cause false positives from GC pauses or transient network blips. Conservative timeouts (10 to 30 seconds) reduce thrashing but extend write downtime. Tune based on workload characteristics and SLO.

✓With asynchronous replication, new leader may lack writes the old leader acknowledged, causing data loss. Majority commit prevents this: new leader elected from majority is guaranteed to have all committed data. Spanner Paxos groups ensure elected leader has complete log.

✓Fencing via terms and quorum writes prevents old leader from corrupting state after partition heals. Old leader must check its term is current and can reach a quorum before accepting writes. Lease based fencing vulnerable to clock skew and long pauses.

✓Follower catch up after election can delay write availability if new leader requires all followers to reach minimum LSN before serving. Cascading replication (followers replicate from other followers) offloads leader but increases lag and complexity.

📌 Interview Tips

1Raft leader election: Followers start election timeout randomized between 150 to 300 ms. On timeout, follower increments term, votes for itself, requests votes from peers. If receives majority votes, becomes leader and starts sending heartbeats. If old leader receives message with higher term, immediately steps down. In practice with 5 node cluster on LAN, election typically completes in under 1 second, but total failover including detection and client retry is 5 to 15 seconds.

2MongoDB replica set split brain prevention: Nodes use replica set configuration version and election protocol. Primary steps down if it cannot reach majority of voting members within 10 seconds (default). During partition, side with majority elects new primary; minority side has no primary and rejects writes. When partition heals, old primary sees higher term from new primary and steps down. Rollback process then truncates uncommitted oplog entries on old primary.

3Kafka controller election via ZooKeeper: Brokers watch ephemeral controller node in ZooKeeper. When controller crashes, ZooKeeper session expires and ephemeral node deleted. Brokers race to create new ephemeral node with their broker ID; first to succeed becomes controller. ZooKeeper session timeout typically 6 to 18 seconds; total controller failover often 10 to 20 seconds including partition leader reelection.

← Back to Leader-Follower Replication Overview