Failure Modes: Clock Skew, Hot Keys, and Split Brain

Multi leader replication surfaces subtle failure modes that single leader systems avoid through serialization. Clock skew is the most insidious when using last writer wins (LWW) conflict resolution based on physical timestamps. If the Tokyo leader's system clock is 2 seconds ahead of London's clock, all Tokyo writes will win conflicts even when London's updates are logically newer from the user's perspective. Under skew, a user updating their profile in London at time T can have that update permanently overwritten by a stale Tokyo write from time T minus 1 second that carries a skewed timestamp of T plus 1 second. This violates user intent and creates data loss that appears nondeterministic. Mitigation requires hybrid logical clocks (HLC) that bound physical clock drift with logical sequence numbers, or avoiding timestamp based LWW entirely for critical fields in favor of vector clocks or conditional writes with explicit versioning.

Hot key thrashing occurs when a popular key (like a celebrity's follower count or a trending hashtag) receives concurrent writes in multiple Regions. Each Region's update creates a conflict, forcing replication of conflict resolution metadata and potentially causing sustained replication lag as the hot key bounces between states faster than cross Region propagation. At 100ms cross Region Round Trip Time (RTT), a key updated every 10ms in two Regions generates 10 conflicts per RTT window. Systems mitigate this by routing ownership of hot keys to a designated home leader and making remote Regions forward writes there (sacrificing local write latency for consistency), sharding within the logical key (splitting a counter across subkeys that merge on read), or rate limiting updates per key. Amazon DynamoDB adaptive capacity automatically allocates more throughput to hot partitions but cannot fully solve write conflicts across Regions.

Split brain during network partitions creates independent write histories that must reconcile on healing. If Tokyo and London are partitioned for 5 minutes and both accept writes to the same user's shopping cart, they build divergent version histories. On reconnection, the system must merge these histories without losing items or creating duplicates. Worse, if writes trigger side effects (sending email confirmations, decrementing inventory), replay during reconciliation can duplicate those effects. Production systems require idempotency with unique operation IDs, explicit deduplication windows, and designing side effects to be retriable. Global uniqueness constraints (like ensuring a username is unique across all Regions) are fundamentally hard without coordination; patterns include allocating disjoint ID spaces per leader (Tokyo gets IDs 1000000 to 1999999, London 2000000 to 2999999) or routing uniqueness checks to a single authority Region.

💡 Key Takeaways

✓Clock skew of even 1 to 2 seconds breaks last writer wins (LWW) by making stale writes with skewed timestamps permanently overwrite newer logical updates; hybrid logical clocks (HLC) or per origin sequence numbers are required for correctness

✓Hot keys updated concurrently in multiple Regions generate conflicts faster than cross Region Round Trip Time (RTT) allows resolution; at 100ms RTT and 10ms update frequency, one key produces 10 conflicts per RTT causing replication lag spikes

✓Network partitions create split brain with divergent write histories; 5 minute partition accepting 1000 writes per side creates 2000 operations that must merge on healing, risking duplicate side effects if operations are not idempotent

✓Global uniqueness constraints (usernames, order IDs) require coordination or partitioning: allocate disjoint ID ranges per leader (Tokyo 1M to 1.99M, London 2M to 2.99M) or route uniqueness checks to single authority sacrificing availability

✓Tombstones for deletes must survive longer than maximum replication lag; premature garbage collection (say 1 hour) before a delayed replica (lagging 2 hours) catches up causes deleted data to reappear during anti entropy repair

✓Read after write consistency across Regions fails without session stickiness or read tokens: user writes in us-east-1 and immediately reads from eu-west-1 before replication completes (sub second typical, but not guaranteed) sees stale data

📌 Interview Tips

1Production clock skew incident: Tokyo EC2 instance clock drifted +3 seconds due to hypervisor issue; 10 minutes of London profile updates (600 operations) were permanently lost to concurrent Tokyo writes until monitoring detected anomalous conflict rates and operations disabled Tokyo leader

2Celebrity tweet hot key: trending hashtag counter updated 50 times per second across 3 Regions with 100ms inter Region RTT; conflict metadata per key grew to 5 KB over 10 minutes as vector clock tracked 30,000 concurrent increments; resolved by sharding counter into 100 subkeys and summing on read

3DynamoDB Global Tables uniqueness: application enforces unique email addresses by using email as partition key and conditional PutItem with attribute_not_exists check; concurrent registration in two Regions can both succeed locally, then one is rejected on replication with ConditionalCheckFailedException requiring application retry logic

4Amazon Dynamo retail cart side effect: inventory decrement and confirmation email triggered on cart checkout; split brain partition caused duplicate decrement and email; fixed by assigning unique checkout operation IDs and deduplicating in inventory service within 10 minute window

← Back to Multi-Leader Replication Overview