Design Fundamentals • CAP TheoremHard⏱️ ~2 min
Failure Modes and Edge Cases in Distributed Consistency
Network partitions are not always clean splits. Asymmetric partitions occur when node A can reach node B, but B cannot reach A, often due to firewall rules, routing loops, or one way packet loss. This breaks failure detection: B might think A is dead and trigger failover, while A believes it is still the leader and continues serving writes. Without fencing tokens or generation numbers, you get dual writers corrupting state. Google Spanner uses epoch leases and majority quorums to prevent this, but transient asymmetry still causes unnecessary failovers and latency spikes.
Slow nodes are indistinguishable from failed nodes due to the FLP impossibility result: you cannot reliably tell if a node is crashed or just experiencing a garbage collection pause. CP systems with aggressive timeouts (say 100ms) may trigger leader elections during a minor GC pause, causing cascading failures as new leaders also pause under load. Adaptive failure detectors (like phi accrual in Cassandra) help, but cannot eliminate false positives. The result is occasional split brain attempts (mitigated by quorums) and tail latency spikes as clients retry during leadership transitions.
Conflict resolution in AP systems is notoriously fragile. Last writer wins with wall clock timestamps fails when clocks skew: a write at 10:00:01 on a fast clock can overwrite a write at 10:00:02 on a slow clock, silently losing the later update. DynamoDB Global Tables experienced issues where clock skew between regions caused data loss until they improved timestamp precision and conflict detection. CRDTs solve this for counters and sets (commutative merges), but application level data like shopping carts still needs custom merge logic (union items, deduplicate). Without careful design, users see items vanish or duplicate.
💡 Key Takeaways
•Asymmetric partitions (A reaches B, B cannot reach A) cause dual leader scenarios; prevent with fencing tokens, generation numbers, and majority quorums, but expect transient unavailability during false failovers
•Slow nodes vs failed nodes are indistinguishable (FLP result); aggressive timeouts (under 100ms) trigger false leader elections during garbage collection pauses, cascading into cluster wide latency spikes
•Last writer wins with wall clock timestamps loses data when clocks skew; DynamoDB Global Tables rely on reasonably synchronized clocks (NTP within hundreds of milliseconds), but multi region skew still causes occasional conflicts
•Stale reads in eventual systems break read your writes guarantees; user writes data then immediately reads from a different replica and sees old state, requiring session stickiness or client version tokens
•Hot shards amplify quorum costs: if one replica is slow due to load, QUORUM reads wait for that replica, degrading p99 tail latency across all clients; use replica selection (dynamic snitching) to avoid known slow nodes
📌 Examples
ZooKeeper ensemble with 5 nodes: a 200ms GC pause on the leader triggers re election; if new leader also pauses, the ensemble thrashes between leaders, causing 10+ seconds of unavailability despite majority being healthy
Cassandra with hot partition: one node serves 10x traffic for a celebrity user; QUORUM reads hit that node and wait, causing p99 latency to spike from 10ms to 200ms even though other replicas are idle
DynamoDB Global Tables: two regions concurrently update a user profile (name change); if clocks differ by 1 second and updates arrive out of timestamp order, the older write overwrites the newer one, losing the intended update
Shopping cart with LWW: user in Region A adds item X at t=1000, user in Region B (same session, stolen cookie) adds item Y at t=999 (due to clock skew); after replication, item Y disappears because its timestamp is older