Leader-Follower Failure Modes and Operational Edge Cases

Leader follower replication has numerous failure modes that can cause data loss, corruption, or extended downtime if not properly handled. Follower lag and backpressure issues arise when followers cannot keep up with the leader's write rate due to slower hardware, high query load, or network bandwidth limits. As the lag grows, the leader's memory and disk consumption for unsent or unacknowledged log entries increases, potentially causing out of memory errors or disk space exhaustion. If lag exceeds the log retention window (often configured as time based, like 7 days, or size based, like 100 GB), the lagging follower can no longer catch up incrementally and must perform a full snapshot bootstrap, which can take hours for terabyte scale databases and saturates network and disk during the transfer.

Catch up storms occur after a widespread outage when multiple followers simultaneously attempt to catch up from the leader. The aggregate bandwidth and disk Input/Output Operations Per Second (IOPS) can overwhelm the leader and shared infrastructure, causing severe latency degradation for foreground writes and reads. For example, if 5 followers each try to replicate at 500 MB per second, the leader must serve 2.5 GB per second of outbound replication traffic on top of serving normal client traffic. Mitigations include throttling catch up rates, staggering recovery (bring up followers one or two at a time), and using cascading replication where some followers replicate from other followers rather than the leader. Cascading replication reduces leader load but increases overall lag since changes flow through multiple hops.

Data loss on asynchronous failover is a subtle but critical edge case. The old leader may have acknowledged writes to clients that were never replicated before it crashed. When a follower is promoted to leader, those writes are gone. The problem compounds if the old leader recovers and rejoins: it has writes the new leader does not have, creating divergent histories. The solution is to truncate the old leader's uncommitted tail (discarding the acknowledged but unreplicated writes) or implement conflict resolution and merge. This is why production systems requiring zero data loss must use synchronous or majority commit despite the latency cost. Systems using asynchronous replication must carefully set expectations with clients about durability: an acknowledgment means the write is durable on the leader only, not necessarily replicated or safe from total loss.

💡 Key Takeaways

✓Follower lag beyond log retention window forces expensive full snapshot bootstrap. PostgreSQL Write Ahead Log (WAL) retention typically 16 to 64 GB or time based; follower more than that behind must copy full database (hours for terabyte scale), blocking normal replication and consuming network and disk.

✓Catch up storms when multiple followers recover simultaneously can saturate leader network and disk, degrading foreground latency by 10 times or more. Example: 5 followers catching up at 500 MB per second each creates 2.5 GB per second outbound load on leader. Mitigate with throttling and staggered recovery.

✓Hot leader or shard from skewed access patterns causes high latency and potential overload. Single partition receiving 10 times median traffic can drive p99 latency from 5 ms to 50 ms plus, affecting entire application. Solutions: repartition with better key distribution, move hot partition to larger instance, or split partition.

✓Asynchronous failover data loss: Old leader acknowledged writes locally, crashed before replicating, new leader promoted without those writes. Clients believe writes succeeded but they are gone. MongoDB with writeConcern 1 (default before 4.0) could lose seconds of writes on failover. Majority write concern eliminates this.

✓Read your writes violations with async replicas: Client writes to leader, immediately reads from follower, does not see update. User experience broken. Amazon RDS read replicas can lag by seconds under load; applications must use LSN gating or session stickiness to leader for consistency.

✓Large transactions or bulk operations block follower apply and lengthen failover times. Single 10 GB transaction can take minutes to apply on follower, during which replication stalls and failover cannot promote that follower. Solution: chunk large operations, enforce transaction size limits (for example 100 MB maximum).

📌 Interview Tips

1PostgreSQL WAL lag and replication slot bloating: Replication slot preserves WAL segments until consumed by standby. If standby offline for hours, slot prevents WAL cleanup and pg_wal directory grows unbounded, potentially filling disk and crashing primary. Monitor pg_replication_slots view for restart_lsn lag; set max_slot_wal_keep_size (PostgreSQL 13 plus) to limit retention and drop slots for permanently failed standbys.

2MySQL semi sync fallback to async under load: With rpl_semi_sync_master_timeout set to 1000 ms, if synchronous replica takes longer than 1 second to acknowledge due to load or network issues, master logs warning and falls back to asynchronous mode, losing zero RPO guarantee. This can happen silently unless monitoring for rpl_semi_sync_master_status variable. Production systems must alert on async fallback to detect durability degradation.

3Kafka broker failure and under replicated partitions: When broker fails, partitions it led become under replicated. Controller elects new leaders from In Sync Replicas (ISR). If min.insync.replicas set to 2 but only 1 replica remains in ISR after failure, producers with acks=all will receive NotEnoughReplicas exceptions and writes will fail. Cluster cannot self heal until failed broker returns or administrators manually reassign partitions to other brokers. This is deliberate trade off for durability over availability.

← Back to Leader-Follower Replication Overview