Failure Modes and Edge Cases in Production

Hot Partition Problem
Hot partitions occur when skewed traffic concentrates on a single partition, capping throughput at 1,000-5,000 operations per second and driving tail latencies from 10ms to 500ms+. Common causes: celebrity profiles, trending products, or monotonic keys like timestamps that concentrate recent writes on one partition.
Common Pitfall: Using auto-increment IDs or timestamps as partition keys concentrates all recent writes on one partition. Use hashed keys or add random suffixes.
Eventual Consistency Anomalies
Lost read-your-writes occurs when a user updates data but immediately reads from a replica that has not yet received the update. Fix: route reads to the same replica that handled the write (session consistency) or include write timestamps and wait for replicas to catch up. Duplicate unique constraints happen when two concurrent writes create the same username because uniqueness is not globally coordinated. Fix: use conditional writes with version checks (compare-and-set) or a centralized allocator service for critical unique values.
Relational Failure Modes
Long transactions cause lock contention: multiple operations wait for the same rows, degrading throughput. Deadlocks occur when two transactions each hold locks the other needs. Replication lag causes replicas to serve stale reads, especially under write-heavy load. Fixes: keep transactions short, use proper indexes to reduce lock scope, and monitor replication lag to fence stale replicas from serving reads.

💡 Key Takeaways

✓Hot partitions cap per-key throughput at 1K-5K ops/sec; fix via hashing partition keys, adding random suffixes, or write sharding

✓Lost read-your-writes: user reads stale data from unreplicated replica; fix with session consistency or read-after-write routing

✓Duplicate unique constraints in leaderless systems: use conditional writes (compare-and-set) or centralized allocators for uniqueness

✓Long transactions cause lock contention and deadlocks; keep transactions short and use proper indexes

📌 Interview Tips

1When designing for celebrity/trending scenarios, explain write sharding: spread updates across N partition keys, aggregate on read

2For unique constraint questions, describe compare-and-set: write succeeds only if current version matches expected, preventing concurrent duplicates

3Mention replication lag monitoring: track how far replicas are behind and fence stale ones from serving reads

← Back to Relational vs NoSQL Overview