Loading...
Design FundamentalsCAP TheoremHard⏱️ ~3 min

CAP Failure Modes: When Distributed Systems Break

The Edge Cases That Matter: Understanding CAP is not enough. You must anticipate how systems fail under CAP constraints. Network partitions are not clean: slow links masquerade as failures, hot partitions amplify quorum costs, and cascading failures turn local problems into global outages. These failure modes determine whether your system survives production traffic or collapses under load. Split Brain and Dual Primaries: In leader based CP systems, a GC pause or network hiccup can make a leader unreachable. Followers elect a new leader, but the old leader wakes up still believing it is primary. Now you have two leaders accepting writes, violating linearizability. Correct implementation requires fencing tokens: the new leader gets epoch N+1 and rejects any writes stamped with epoch N. Storage systems must refuse operations with stale epochs. Without this, you get data corruption. For example, two leaders both increment a counter from 100, one writes 101, the other writes 101, final value is 101 instead of 102. Leader thrashing: if network is flaky, elections happen repeatedly. Each election costs coordination rounds (100 to 500ms typical), during which writes queue up. Tail latency spikes to seconds. Mitigation: adaptive timeouts that grow with observed RTT variance, and witness nodes in separate failure domains. Cache Stampede and Thundering Herd: AP systems often use caching to hide eventual consistency lag. A hot key expires in cache, and suddenly 10,000 concurrent requests hit the database simultaneously. Database query time spikes from 10ms to 30 seconds, cascading into connection pool exhaustion and service unavailability. Stale while revalidate pattern: serve stale cache entry immediately, trigger background refresh. Only one request does the refresh, others get stale data with 1ms latency instead of waiting. Alternative: probabilistic early expiration, where cache refreshes slightly before TTL expires based on load. Thundering herd on startup: 1,000 servers restart after deploy, all simultaneously query configuration service. Configuration service overloads, new servers cannot start, triggering automated replacement, which restarts more servers. Mitigation: staggered rollouts, startup jitter, and static fallback configs.
❗ Remember: Timeouts create false partitions. A database query taking 5 seconds due to lock contention looks identical to a network partition to the client. If you treat it as a partition and fail over, you may start dual writes or serve stale data unnecessarily.
Read Replica Lag and Anomalies: Asynchronous replication in AP systems means replicas lag behind primary. Under heavy write load or network congestion, lag grows from milliseconds to seconds. User writes to primary, immediately reads from replica, sees old data. This violates read your writes, causing "my post disappeared" bugs. Measure replication lag continuously. DynamoDB Global Tables report cross region lag; monitor it and surface to application. If lag exceeds threshold (say 5 seconds), either wait for catch up, read from primary (adding latency), or show user a "data syncing" indicator. Do not pretend lag does not exist. Worst case: replica lags 10 seconds behind during incident. User submits form, sees confirmation, refreshes page (hits lagged replica), sees form blank again, resubmits, creates duplicate. Defense: client side deduplication tokens, session sticky routing, or pessimistically assume writes are slow and show pending state. Hot Partitions and Tail Latency Amplification: Quorums amplify impact of slow nodes. With R=2, your read latency is the maximum of 2 replicas. If one replica is slow (GC pause, disk stall, hot CPU), your read waits. At scale, the slowest of N nodes dominates tail latency. Hot partition example: celebrity user tweets, 80% of read traffic hits their shard. If that shard requires quorum and one replica is loaded, p99 latency jumps from 10ms to 500ms for everyone reading that shard. Meanwhile, other shards remain fast. Mitigation: adaptive replica selection (skip replicas with p99 > threshold), hedged requests (send backup request after 20ms, take first response), and cache hot keys at edge. Spotify caches top artist data in CDN to avoid hammering database shards. Cascading Failures Across Consistency Boundaries: CP services are often dependencies of AP services. If your CP coordination layer (ZooKeeper, Consul) slows down due to quorum issues, all dependent services that check locks or config on every request also slow down. Connection pools fill, timeouts trigger, and the entire system becomes unavailable despite most nodes being healthy. Defense in depth: circuit breakers on calls to CP services (fail fast after N timeouts), cached fallback values (use stale config rather than block), and careful evaluation of whether each operation truly needs CP guarantees. Many "coordination" use cases can tolerate eventual consistency if you design for it.
💡 Key Takeaways
Split brain in leader based systems causes dual primaries accepting conflicting writes unless fencing tokens enforce epoch ordering, requiring storage to reject stale epoch writes to prevent data corruption
Cache stampede occurs when hot key expires and 10,000+ concurrent requests hit database simultaneously, spiking query time from 10ms to 30 seconds and exhausting connection pools, prevented by stale while revalidate serving cached data while one request refreshes
Asynchronous replication creates read replica lag that grows under load, violating read your writes when user writes to primary then reads from lagged replica, requiring session sticky routing or client deduplication tokens
Hot partitions with quorum reads amplify tail latency: if 80% of traffic hits one shard and one replica is slow, p99 jumps from 10ms to 500ms for that shard while others remain fast, requiring adaptive replica selection or hedged requests
Cascading failures occur when slow CP coordination service (ZooKeeper) blocks dependent services on every request, filling connection pools and making entire system unavailable despite most nodes healthy, prevented by circuit breakers and cached fallback values
📌 Examples
ZooKeeper during network partition: 2 of 5 nodes partitioned, remaining 3 elect new leader in 200ms. Old leader wakes after GC pause with stale epoch. Storage layer rejects old leader writes based on epoch N vs N+1 fencing tokens
Reddit cache stampede: popular post expires from cache, 50,000 requests hit PostgreSQL simultaneously. Query time goes from 15ms to 60 seconds, connection pool (100 max) exhausted in 2 seconds, site becomes unavailable for 5 minutes
Instagram read replica lag: user posts photo, write goes to primary in 5ms. User refreshes feed, reads from replica lagged 3 seconds behind, photo missing. User reports bug, confusion ensues. Fix: session token ensures reads hit primary for 10 seconds post write
Twitter hot partition: celebrity with 100M followers tweets. 80% of read QPS hits their shard. One replica experiences GC pause (500ms). All quorum reads to that shard wait 500ms, p99 latency for entire site spikes. CDN cache added for celebrity accounts
← Back to CAP Theorem Overview
Loading...
CAP Failure Modes: When Distributed Systems Break | CAP Theorem - System Overflow