Database Design • Read-Heavy vs Write-Heavy OptimizationHard⏱️ ~3 min
Failure Modes in Read and Write Optimized Systems
Production systems face distinct failure modes depending on optimization bias. Understanding these edge cases and implementing mitigations separates resilient architectures from fragile ones.
Cache stampede occurs when a popular cache key expires and thousands of concurrent requests simultaneously hit the origin database. A single cache miss can generate 10,000+ origin queries in milliseconds, overwhelming the database and causing p99 latency to spike from 10ms to multiple seconds. Mitigation strategies include probabilistic early expiration (refresh before TTL), request coalescing (deduplicate concurrent fetches), and stale while revalidate patterns (serve expired data while refreshing). Twitter actively manages this for celebrity tweets that would otherwise cause thundering herds.
Hot partitions emerge when poor key distribution causes one shard to absorb disproportionate traffic. A celebrity user update can route 80% of writes to a single shard while others sit idle, causing p99 latency on that shard to jump from 10ms to 500ms or more. Meta's systems use hybrid fanout strategies: high follower accounts trigger fanout on read to avoid writing one tweet to 30 million inboxes. Additional mitigations include load aware partitioning, sharded counters for viral content, and dynamic shard splitting.
Replication lag and read after write anomalies plague read heavy systems with asynchronous replication. Users post content and immediately query for it, but the read hits a replica that is 10 seconds behind, showing the post as missing. This is particularly problematic for user facing features. Solutions include session stickiness (route author to primary for T seconds), monotonic reads (track LSN and only read from replicas past that point), or exposing consistency in the User Interface (UX) with "your post is processing" messaging.
Fanout write storms happen when high follower accounts trigger massive write amplification. A single tweet to 30 million followers without mitigation becomes 30 million individual writes, potentially saturating storage Input/Output Operations Per Second (IOPS), queue capacity, and network bandwidth. Systems must implement selective fanout strategies, write coalescing, batch compression, and priority queuing with backpressure to survive viral events.
💡 Key Takeaways
•Cache stampede from popular key expiration generates 10K+ concurrent origin requests in milliseconds, spiking p99 from 10ms to seconds: mitigate with request coalescing and stale while revalidate
•Hot partitions route 80% of traffic to one shard when celebrity or viral content hits: Twitter uses hybrid fanout, avoiding 30M writes by computing high follower timelines on read
•Replication lag causes read after write anomalies when async replicas fall 10+ seconds behind: route authors to primary or track LSN to enforce monotonic reads
•Fanout write storms amplify single viral event into millions of writes: require selective fanout, batch compression, and priority queues with backpressure
•Multi master conflicts from clock skew and concurrent updates: last write wins with unsynchronized clocks drops updates, requires logical clocks or Conflict free Replicated Data Types (CRDTs)
📌 Examples
Twitter celebrity tweet: 30M followers would cause fanout storm, instead system computes timeline on read for high follower accounts and caches result
Meta TAO employs request coalescing for cache misses: multiple concurrent requests for same key deduplicated to single origin fetch, preventing stampede
LinkedIn feed consumers: during traffic spikes, consumer lag can grow to minutes or hours, requiring autoscaling and monitoring of partition lag metrics