Failure Modes: Topology Changes, Schema Drift, and Connection Storms

Topology Changes During Failover
Read replica architectures introduce failure modes absent in single-instance deployments. The most dangerous: topology changes during failover. When a replica is promoted to primary, write traffic must immediately shift to the new primary while read traffic reconfigures.
If routers and application caches do not detect the promotion quickly, you risk split-brain: some writes go to the old primary (now isolated), others to the new primary, causing data divergence that is extremely difficult to reconcile. Detection latency directly impacts data consistency risk.
Replica Promotion Process
When the primary fails, the system must elect and promote a replica. The promoted replica stops accepting replication, flushes pending writes to disk, and begins accepting write traffic. Other replicas reconfigure to replicate from the new primary. This process typically takes 10-30 seconds for automated failover.
During promotion, all writes fail. Applications should implement retry logic with exponential backoff—waiting progressively longer between retries (e.g., 100ms, 200ms, 400ms). This prevents overwhelming the recovering system while writes queue during failover.
Cascading Replica Failures
When replicas fail, read traffic concentrates on survivors. If one of three replicas fails, the remaining two handle 50% more traffic each. If they are already near capacity, this overload can cascade: the increased load causes slower responses, health checks fail, and the router removes them, sending all traffic to the primary.
Sudden primary overload from absorbed replica traffic can cause the primary to fail, triggering a full outage. Defense requires capacity headroom: replicas should run at 50-60% capacity normally, allowing them to absorb failover load.
Network Partition Handling
Network partitions create subtle failure modes. A replica may be reachable by some routers but not others, or reachable but not receiving replication. Applications might send reads to a partition-isolated replica that returns increasingly stale data without any error signals.
Robust systems verify both connectivity and replication health. A replica that cannot reach the primary should mark itself unhealthy and refuse reads, rather than silently serving stale data. This fail-fast behavior surfaces problems quickly instead of hiding them behind eventual consistency semantics.

💡 Key Takeaways

✓Replica promotion during failover creates split brain risk if routers do not detect role changes within seconds. Amazon RDS DNS updates take 30 to 60 seconds; stale connection pools or aggressive DNS caching can extend inconsistency windows to minutes

✓Schema drift during migrations causes query failures or wrong results when queries selecting new columns route to replicas that have not yet applied DDL, requiring phased rollouts with replication lag verification between steps

✓Connection storms from naive retry logic can overwhelm remaining replicas when one fails. Thousands of clients simultaneously refilling connection pools create thundering herds that cause cascading overload

✓Forward compatible migrations add nullable columns or tables that existing queries ignore, then wait for full replication (all replicas at lag zero) before deploying code that reads or writes new schema elements

✓Production systems implement jittered exponential backoff (spread retries over random intervals), per client connection rate limits (cap attempts per second), and router admission control (reject connections when queue depth exceeds thresholds)

📌 Interview Tips

1Topology change: Primary fails, replica promoted. Old primary DNS cached for 60 seconds. App writes to old primary (now isolated), reads from new primary. Data diverges, requires manual reconciliation and data loss acceptance.

2Schema drift: Migration adds 'last_login' column. DDL replicates in 150ms. App deployed immediately starts querying last_login. Reads hitting lagging replicas return column does not exist error for 150ms, causing user facing 500 errors.

3Connection storm: Replica with 5,000 client connections crashes. Each client connection pool tries to reconnect immediately. 5,000 clients × 10 connections each = 50,000 connection attempts hit remaining 4 replicas in 1 second, overwhelming them and cascading failure.

← Back to Read Replicas & Query Routing Overview