Database DesignRead Replicas & Query RoutingHard⏱️ ~3 min

Failure Modes: Topology Changes, Schema Drift, and Connection Storms

Read replica architectures introduce failure modes that do not exist in single instance deployments. Understanding these edge cases is critical for production reliability. Topology changes during failover are the most dangerous. When a replica is promoted to primary, write traffic must immediately shift to the new primary while read traffic reconfigures. If routers and application caches do not detect the promotion quickly, you risk split brain: some writes go to the old primary (now isolated), some to the new primary, causing data divergence. Managed services like Amazon RDS automate this with endpoint DNS updates that propagate within 30 to 60 seconds, but aggressive DNS caching or stale connection pools can extend the window. Best practice: implement health checks that verify role (primary versus replica) on every connection pool refresh, and cache DNS for no more than 10 to 15 seconds. Schema drift creates subtle correctness bugs. You execute a forward compatible migration: add a nullable column to a table on primary. The DDL commits and begins replicating. For the next 100 milliseconds (or longer under load), replicas do not have the new column. If a query selects that column and routes to a lagging replica, it returns an error or wrong result set. Worse, if you immediately start writing to the new column, reads from replicas show null until replication catches up, violating read after write. Mitigation requires phased rollouts: deploy DDL, wait for all replicas to apply (verify via replication lag metrics), then deploy application code that uses new schema. Some teams use dual write patterns: write to both old and new schema simultaneously during a transition window. Connection storms occur when a replica fails suddenly. Naive retry logic in clients can create a thundering herd: thousands of connections simultaneously retry to remaining replicas, overwhelming them and causing cascading failure. This is especially severe if clients have per instance connection pools and the pool manager attempts to refill all connections immediately upon detecting failure. Mitigation strategies include jittered exponential backoff (spread retries over time), per client connection limits, and router level admission control that rejects new connections when backend queue depth exceeds safe thresholds. At large scale, Microsoft and Meta use client side rate limiters that cap connection attempts per second per client instance, preventing any single failure from generating more than a bounded retry load.
💡 Key Takeaways
Replica promotion during failover creates split brain risk if routers do not detect role changes within seconds. Amazon RDS DNS updates take 30 to 60 seconds; stale connection pools or aggressive DNS caching can extend inconsistency windows to minutes
Schema drift during migrations causes query failures or wrong results when queries selecting new columns route to replicas that have not yet applied DDL, requiring phased rollouts with replication lag verification between steps
Connection storms from naive retry logic can overwhelm remaining replicas when one fails. Thousands of clients simultaneously refilling connection pools create thundering herds that cause cascading overload
Forward compatible migrations add nullable columns or tables that existing queries ignore, then wait for full replication (all replicas at lag zero) before deploying code that reads or writes new schema elements
Production systems implement jittered exponential backoff (spread retries over random intervals), per client connection rate limits (cap attempts per second), and router admission control (reject connections when queue depth exceeds thresholds)
📌 Examples
Topology change: Primary fails, replica promoted. Old primary DNS cached for 60 seconds. App writes to old primary (now isolated), reads from new primary. Data diverges, requires manual reconciliation and data loss acceptance.
Schema drift: Migration adds 'last_login' column. DDL replicates in 150ms. App deployed immediately starts querying last_login. Reads hitting lagging replicas return column does not exist error for 150ms, causing user facing 500 errors.
Connection storm: Replica with 5,000 client connections crashes. Each client connection pool tries to reconnect immediately. 5,000 clients × 10 connections each = 50,000 connection attempts hit remaining 4 replicas in 1 second, overwhelming them and cascading failure.
← Back to Read Replicas & Query Routing Overview
Failure Modes: Topology Changes, Schema Drift, and Connection Storms | Read Replicas & Query Routing - System Overflow