Loading...
Design FundamentalsCAP TheoremHard⏱️ ~3 min

CAP Failure Modes and Edge Cases

Split Brain: When CP Goes Wrong Split brain happens when a partition causes both sides to think they're the majority. Suppose a 3 node cluster gets partitioned where each side can talk to 1 node but not the other. If both sides accept writes, you get divergent histories. Proper CP implementations prevent this using term numbers or epoch counters. Each leader election increments a term. Nodes reject requests from leaders with stale terms. If a buggy implementation skips this check, you get conflicting data that requires manual reconciliation. MongoDB had a split brain vulnerability in versions before 3.4 under specific network conditions that were patched by strengthening majority calculation. Slow Networks Masquerading as Partitions CAP assumes binary network state: working or partitioned. Reality is messier. Suppose inter node RTT spikes from 1ms to 500ms due to network congestion or a 3 second garbage collection pause blocks a process. With client timeouts set to 200ms and quorum timeouts at 300ms, this looks like a partition to consensus algorithms. CP systems start failing requests and electing new leaders. AP systems continue working but replication lag grows. The problem: these "soft partitions" are more common than hard failures. Production systems see RTT spikes above 100ms roughly once per day in large clusters. If your CP system interprets every spike as a partition, availability drops significantly. Solution: adaptive timeouts and gray failure detection. Instead of fixed 300ms timeout, use 3x the recent p99 RTT. Track partial connectivity (can reach 2 of 3 nodes) differently from total isolation. Eventual Consistency Conflicts AP systems using last write wins have a subtle data loss bug. Two clients in partitioned zones update the same shopping cart. Zone A adds item at timestamp 1000. Zone B adds different item at timestamp 1001, but its clock is 200ms behind, so it records timestamp 801. When zones reconcile, the system picks timestamp 1001 as latest and discards the 801 write. User loses their item. At scale, with millions of carts and clock drift up to 200ms, you lose roughly 0.01% of concurrent updates. Better approach: version vectors. Each write includes a vector clock showing which replicas have seen which versions. Conflicts are detected and resolved with application logic (merge cart items) rather than silently dropping data.
❗ Remember: CAP only applies during partitions. Many "CA" systems people cite, like single node databases, simply fail entirely during network issues. True distributed systems are always CP or AP.
The PACELC Extension PACELC extends CAP: if there's a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency. This captures that even without partitions, strong consistency costs latency. Google Spanner (PC/EL): chooses consistency during partitions and during normal operation, paying 5ms to 10ms latency cost. Cassandra (PA/EL): chooses availability during partitions and low latency during normal operation, accepting eventual consistency.
💡 Key Takeaways
Split brain occurs when both partition sides think they're majority. Proper CP systems use term numbers to detect stale leaders and reject conflicts.
Slow networks (RTT spikes to 500ms) look like partitions to systems with 200ms to 300ms timeouts. This causes false failovers reducing availability.
AP systems using last write wins lose data when clock skew exceeds 200ms during concurrent updates. Version vectors detect conflicts but require merge logic.
PACELC extends CAP: during partitions choose A or C, during normal operation choose latency or consistency. Spanner is PC/EC, Cassandra is PA/EL.
📌 Examples
1Clock skew bug: Two data centers update same user profile. DC1 writes at real time 1000ms (clock shows 1000). DC2 writes at real time 1100ms (clock shows 900 due to 200ms skew). Last write wins picks 1000ms, loses the actually newer 1100ms write.
2GC pause partition: Java process pauses for 3 seconds during full GC. Other nodes timeout after 500ms, elect new leader, start accepting writes. Original leader wakes up, doesn't realize it's stale, creates conflicting data unless term checks prevent it.
← Back to CAP Theorem Overview
Loading...