Caching • Distributed CachingHard⏱️ ~3 min
Cache Stampede and Hot Key Overload: Failure Modes at Scale
Cache stampede, also called the thundering herd problem, occurs when a popular key expires and thousands of concurrent requests simultaneously discover the cache miss and attempt to refetch from the backend database. At scale, this can spike origin queries per second (QPS) by 10,000 times or more in milliseconds, overwhelming the database and causing cascading failures. For example, if a celebrity profile cached at 50,000 reads per second expires at noon, and cache lookup takes 1 ms while database fetch takes 20 ms, then during that 20 ms refill window, 1,000 requests will miss the cache and each will independently query the database, creating a 1,000 times amplification. Production systems use several mitigations: request coalescing collapses multiple concurrent requests for the same key into one backend fetch; lease based refill grants one requester exclusive rights to refresh via a token while others wait or receive stale data; soft TTL with background refresh serves slightly stale data while a background worker refreshes before the hard TTL expiry; and jittered TTLs randomize expiration times by plus or minus 10 to 20 percent to prevent synchronized expirations across many keys.
Hot key overload is a related problem where a small set of keys receive disproportionate traffic, saturating a single cache partition or node. In a typical distributed cache, consistent hashing maps each key to one primary node. If one key receives 100,000 requests per second while average keys receive 10 requests per second, that single node must handle 10,000 times the traffic of others, quickly hitting CPU, network, or memory bandwidth limits. This is common in social networks (celebrity posts), e-commerce (flash sale items), and content platforms (viral videos). Mitigations include key replication, where the hot key is copied to multiple nodes and clients fan out reads across replicas, spreading load horizontally; request level rate limiting or admission control to cap per key QPS and shed excess load; and key splitting, where semantics allow breaking one logical key into multiple physical keys with client side aggregation. Meta engineering talks describe replicating extremely hot keys across dozens of memcache servers and using mcrouter to distribute reads, while Netflix has discussed per key throttling to protect cache nodes from single key traffic spikes.
Both failure modes illustrate the importance of observability and guardrails at scale. Systems must track per key request rates, per node load distribution, cache miss storms (sudden spikes in miss rate), and backend query amplification. Automated circuit breakers should shed non critical traffic or serve stale data when backend load exceeds safe thresholds. At Meta scale, even a 2 percent drop in hit rate due to stampedes or hot key evictions can translate into hundreds of thousands of extra database queries per second, so real time alerting tied to origin load rather than just hit ratio is critical. Production runbooks often include pre warming strategies (loading popular keys before TTL expiry) and gradual rollout of cache changes to detect issues before they affect full traffic.
💡 Key Takeaways
•Cache stampede amplification factor equals (backend latency divided by cache latency) times request rate. A key at 50,000 requests per second with 20 ms database fetch and 1 ms cache hit creates 1,000 concurrent database queries during refill, spiking load by 1,000 times.
•Lease based refill solves stampedes by granting one requester exclusive refresh rights via a token while others receive stale data or wait briefly. Meta memcache uses this pattern to protect MySQL from thundering herds on popular social graph keys.
•Soft TTL with background refresh decouples user latency from revalidation: serve slightly stale data (within seconds of expiry) while a background worker fetches fresh data before hard TTL, so users never wait for slow backend queries.
•Hot key overload saturates individual cache nodes when one key receives 10,000 times more traffic than average. Mitigation includes replicating the hot key to multiple nodes and distributing reads, or rate limiting per key QPS and shedding excess requests.
•At Meta scale, a 2 percent hit rate drop from stampedes or hot key evictions translates into hundreds of thousands of extra backend queries per second. Automated alerts tied to origin load rather than just hit ratio are critical for detecting issues.
•Jittered TTLs randomize expiration by plus or minus 10 to 20 percent to prevent synchronized expiration storms where thousands of keys expire simultaneously, spreading refill load over time instead of creating traffic spikes.
📌 Examples
Meta memcache stampede protection: a viral celebrity post cached at 100,000 reads per second uses leases. When the key expires, memcache grants one lease token to refill from TAO while the other 99,999 requests receive the stale cached version, preventing a 100,000 query spike to the MySQL backend.
Netflix EVCache hot key replication: during a popular show release, the metadata key receives 500,000 requests per second. EVCache replicates the key across 10 nodes in each of 3 availability zones and uses client side load balancing to distribute reads, reducing per node load to 16,000 requests per second.
E-commerce flash sale: a product page key normally serves 100 requests per second but spikes to 50,000 during a sale. Without coalescing, the cache miss on sale start would trigger 50,000 simultaneous database queries. Request coalescing collapses them into one fetch, then broadcasts the result to all waiting requests.