Caching • Cache Stampede ProblemEasy⏱️ ~3 min
What is Cache Stampede and Why Does It Happen?
Cache stampede, also known as the dogpile effect, occurs when many clients simultaneously experience a cache miss on the same key and all attempt to fetch or recompute the backing data at once. This typically happens when a Time To Live (TTL) expires, during mass invalidation events, or after a cache cold start (such as during server restarts). The simultaneous rush of requests completely negates the benefit of caching and can overwhelm the origin database or service, causing severe latency spikes and elevated error rates.
The severity of stampede correlates directly with key popularity and TTL configuration. A single hot key serving 20,000 requests per second (RPS) with a 5 minute TTL can generate 20,000 concurrent origin calls the instant it expires. During peak traffic hours, this surge can trigger system wide brownouts or cascading failures. Facebook experienced a notable 4 hour outage in 2010 where dogpiling behavior and retry storms amplified a cascading failure across backend components. The mathematical reality is stark: without mitigation, origin load spikes from near zero (during cached periods) to thousands of queries per second (QPS) in milliseconds, often exceeding database connection pools and overwhelming query execution threads.
The fundamental problem is synchronization. When thousands of application servers all use identical TTL values, cache entries expire at exactly the same moment across the entire fleet. Every server detects a miss simultaneously and issues its own origin request. The origin system, designed to handle perhaps 100 QPS under normal load, suddenly receives 20,000 concurrent requests. Response times degrade from single digit milliseconds to seconds or timeout entirely, creating user visible latency spikes and potential data loss if writes are involved.
💡 Key Takeaways
•Hot key at 20,000 RPS with 5 minute TTL generates 20,000 concurrent origin requests on expiry, spiking origin load from baseline to extreme levels instantly
•Facebook 2010 outage lasted 4 hours with dogpiling and retry storms amplifying cascading failures across backend infrastructure
•Synchronization is the root cause: identical TTL values across distributed fleet cause simultaneous expiration and miss detection on all nodes
•Origin systems designed for 100 QPS steady state cannot absorb 20,000 QPS burst without connection exhaustion, thread saturation, and timeout cascades
•Shorter TTLs increase stampede frequency while longer TTLs increase staleness; the more popular the key, the more catastrophic each stampede event
•Cold cache scenarios (restarts, deployments) trigger stampedes on all keys simultaneously rather than just one, multiplying the impact across entire key space
📌 Examples
Production scenario: Homepage 'trending posts' key serves 50,000 RPS. Without mitigation, TTL expiry causes database connection pool (sized for 200 connections) to receive 50,000 simultaneous queries, exhausting connections in under 100ms and causing 30+ second timeouts for all requests.
Reddit during breaking news: Popular post cache expires during traffic spike from 10,000 to 100,000 concurrent users. All 100,000 users miss cache simultaneously, overwhelming PostgreSQL with aggregation queries that normally take 200ms but now queue for 45+ seconds.
E-commerce flash sale: Product inventory key at 15,000 RPS expires exactly when sale starts. Database receives 15,000 concurrent SELECT FOR UPDATE queries, creating lock contention that blocks inventory updates and causes lost sales.