Failure Modes: Topology Changes, Schema Drift, and Connection Storms
Topology Changes During Failover
Read replica architectures introduce failure modes absent in single-instance deployments. The most dangerous: topology changes during failover. When a replica is promoted to primary, write traffic must immediately shift to the new primary while read traffic reconfigures.
If routers and application caches do not detect the promotion quickly, you risk split-brain: some writes go to the old primary (now isolated), others to the new primary, causing data divergence that is extremely difficult to reconcile. Detection latency directly impacts data consistency risk.
Replica Promotion Process
When the primary fails, the system must elect and promote a replica. The promoted replica stops accepting replication, flushes pending writes to disk, and begins accepting write traffic. Other replicas reconfigure to replicate from the new primary. This process typically takes 10-30 seconds for automated failover.
During promotion, all writes fail. Applications should implement retry logic with exponential backoff—waiting progressively longer between retries (e.g., 100ms, 200ms, 400ms). This prevents overwhelming the recovering system while writes queue during failover.
Cascading Replica Failures
When replicas fail, read traffic concentrates on survivors. If one of three replicas fails, the remaining two handle 50% more traffic each. If they are already near capacity, this overload can cascade: the increased load causes slower responses, health checks fail, and the router removes them, sending all traffic to the primary.
Sudden primary overload from absorbed replica traffic can cause the primary to fail, triggering a full outage. Defense requires capacity headroom: replicas should run at 50-60% capacity normally, allowing them to absorb failover load.
Network Partition Handling
Network partitions create subtle failure modes. A replica may be reachable by some routers but not others, or reachable but not receiving replication. Applications might send reads to a partition-isolated replica that returns increasingly stale data without any error signals.
Robust systems verify both connectivity and replication health. A replica that cannot reach the primary should mark itself unhealthy and refuse reads, rather than silently serving stale data. This fail-fast behavior surfaces problems quickly instead of hiding them behind eventual consistency semantics.