Database Design • Database Selection FrameworkHard⏱️ ~3 min
Database Selection Failure Modes and Mitigation Strategies
Hot partitions emerge when workload skew concentrates traffic on a single shard, violating the uniform distribution assumption of most sharding schemes. A celebrity user posting on a social platform can route 80% of read traffic to one partition, inflating p99 latencies from 10 milliseconds to 500+ milliseconds as the partition's queue depth grows. Meta observed this with high follower count accounts in their social graph: a single celebrity post triggering notifications to 100 million followers saturated one TAO cache node. Mitigation requires key salting (appending random suffixes to distribute hot keys), adaptive load balancing that detects and routes around hot shards, or auto splitting partitions when per partition QPS exceeds thresholds like 50K requests per second.
Replication lag under write spikes causes read after write anomalies that break user expectations. During a product launch generating 10x normal write traffic, asynchronous replication fell 8 seconds behind at Amazon. Users who added items to their cart and immediately refreshed saw empty carts because reads hit a lagging replica. The solution combines monitoring lag via log sequence numbers or replication offsets, enforcing read your writes semantics by routing user sessions to the same replica or the primary for a window after writes, and bounded staleness guarantees that reject reads from replicas lagging more than 5 seconds.
Cache stampede or thundering herd occurs when a popular cache entry expires and thousands of concurrent requests simultaneously query the backing database. A Reddit thread going viral caused 50K requests per second to hit PostgreSQL when the cache entry expired, spiking CPU to 100% and p99 latency to 5 seconds for 30 seconds until the cache warmed. Request coalescing collapses concurrent requests for the same key into a single backend query. Stale while revalidate serves the expired cache entry while asynchronously refreshing it in the background. Probabilistic early expiration refreshes entries slightly before expiration based on load and time to live, spreading refresh traffic.
Distributed transaction deadlocks manifest when two transactions acquire locks in different orders across partitions. Under high contention on inventory records during a flash sale, Google Cloud Spanner experienced elevated transaction abort rates reaching 5% when concurrent transactions updated the same popular products. The commit latency for successful transactions jumped from 50 milliseconds to 200+ milliseconds due to retries. Optimistic concurrency control with versioning avoids locks: read the entity version, apply changes, and commit only if the version has not changed. Reducing transaction scope to single partition updates eliminates cross partition coordination. Exponential backoff with jitter on retries prevents retry storms.
💡 Key Takeaways
•Hot partitions from skewed access (celebrity users, flash sales) concentrate 80% of traffic on one shard, inflating p99 latency from 10 milliseconds to 500+ milliseconds; mitigate with key salting, adaptive load balancing, or auto splitting at 50K QPS thresholds
•Replication lag during write spikes (10x normal traffic) causes 5 to 10 second lag where users see stale data after their own writes; enforce read your writes by routing to primary or same replica, reject reads from replicas lagging over 5 seconds
•Cache stampede when popular entries expire triggers 50K+ simultaneous backend requests spiking database CPU to 100% and p99 latency to 5 seconds; use request coalescing, stale while revalidate, or probabilistic early expiration
•Distributed transaction deadlocks under high contention cause 5% abort rates and commit latency jumping from 50 milliseconds to 200+ milliseconds; use optimistic concurrency with versioning, reduce transaction scope to single partitions, apply exponential backoff
•Compaction and garbage collection stalls in LSM based stores cause read amplification and latency spikes to 500+ milliseconds during write bursts; rate limit ingestion, isolate compaction IO, monitor compaction backlog depth
📌 Examples
Reddit cache stampede: viral thread with 50K requests per second hit PostgreSQL when cache expired, CPU spiked to 100%, p99 latency reached 5 seconds for 30 seconds, implemented stale while revalidate reducing stampede requests by 99% and keeping p99 under 50 milliseconds
Amazon shopping cart replication lag: product launch generated 10x write traffic, async replication lagged 8 seconds, users saw empty carts after adding items, implemented session affinity to route users to same replica for 30 seconds after writes, reducing anomaly rate from 2% to 0.01%