CachingCache Stampede ProblemHard⏱️ ~3 min

Failure Modes: Lock Leakage, Cold Cache, and Hot Key Attacks

Production cache stampede mitigations fail in predictable ways that require explicit defenses. Lock leakage occurs when a lease holder crashes, hangs, or experiences network partition after acquiring a per key lock but before completing refresh and releasing the lock. If lock Time To Live (TTL) is set too conservatively (e.g., 5 seconds for a typical 200ms refresh), all subsequent requests block for up to 5 seconds waiting for a lock that will never be released by the crashed holder. If lock TTL is too aggressive (e.g., 150ms for 200ms P99 refresh), the lock expires before refresh completes, allowing a second requester to acquire the lock and duplicate work. The solution is fencing tokens: each lease includes a monotonic version number; the cache accepts writes only from the current lease holder and rejects late writes from expired leases. Additionally, implement backup writer logic where followers wait for lock TTL but then acquire a new lease if the original holder fails, with jittered retry backoff to prevent thundering retry herds. Cold cache scenarios after restarts, deployments, or failures create massive synchronized stampedes because all keys are missing simultaneously instead of expiring individually over time. A fleet restart brings up 1,000 empty cache instances; the first request wave causes cache misses on every single key, potentially generating millions of origin requests in seconds. Mitigation requires multi layered defense. First, implement cache warming: before serving traffic, preload critical hot keys from origin or a backup cache tier. Second, use progressive traffic ramping: gradually increase request rate over 5 to 10 minutes rather than instant full traffic. Third, deploy request shedding with quality of service (QoS) prioritization: under extreme origin load, drop low priority requests (e.g., background analytics) while preserving high priority user facing requests. Fourth, combine with Stale While Revalidate (SWR): maintain a secondary persistent cache (Redis, disk) that survives restarts and serves as stale fallback during cold start. Hot key attacks, whether malicious or organic (viral content, breaking news), can overwhelm even well designed systems. An attacker repeatedly requests a specific key with low TTL or uses cache busting parameters to force misses. A viral post on Reddit can spike from 100 RPS to 100,000 RPS in seconds. Defenses include per key concurrency caps (e.g., maximum 1 in flight refresh per key regardless of request volume), token bucket rate limiting at the origin for refresh operations (e.g., limit total refresh QPS to 1,000 even if 10,000 keys need refresh), and serving stale or default fallback content under extreme pressure. Implement hot key detection: track per key request rates and automatically extend TTLs or switch to dedicated refresh workers for keys exceeding thresholds (e.g., above 10,000 RPS). For truly critical keys, use replication: cache multiple copies across nodes to increase hit parallelism without increasing origin load.
💡 Key Takeaways
Lock leakage: crashed lease holder with 5 second lock TTL blocks all refresh for 5 seconds; too short lock TTL (150ms for 200ms P99 refresh) causes duplicate work and write conflicts
Fencing tokens solve late writer problem: monotonic version numbers ensure cache rejects writes from expired leases, preventing lost updates when multiple refreshes overlap
Cold cache on restart: 1,000 empty instances generate millions of origin requests on first traffic wave; requires cache warming, progressive traffic ramp over 5 to 10 minutes, and request shedding with QoS
Hot key attacks or viral content: key spikes from 100 to 100,000 RPS in seconds; defense requires per key concurrency cap (max 1 refresh), origin rate limiting (e.g., 1,000 QPS cap), and serving stale or default under pressure
Negative cache pathology: caching 404 not found with 60 second TTL prevents stampedes on missing keys but can persist 404s after late writes; mitigation uses short negative TTL (5 to 10s) and invalidate on write
Thundering retry storm: when lock acquisition fails, followers must use exponential backoff with jitter (10ms, 20ms, 40ms, 80ms delays) to prevent synchronized retry waves that compound the problem
📌 Examples
Production incident: Lease holder crashes during 500ms database query. Lock TTL set to 2 seconds. For next 2 seconds, 40,000 requests (20k RPS × 2s) queue waiting for lock, causing user timeouts. Post incident, team reduces lock TTL to 800ms and adds fencing tokens to reject late writes.
Reddit during breaking news: Popular post jumps from 500 RPS to 80,000 RPS in 30 seconds. Per key concurrency cap limits refresh to 1 in flight; remaining 79,999 RPS served from stale cache. Origin load stays at 1 QPS for that key instead of spiking to 80,000 QPS. Hot key detector triggers, extends TTL from 2 minutes to 10 minutes.
E-commerce deployment: Rolling restart of 500 cache instances over 5 minutes. Without warming, each instance misses all keys on startup, generating 500 × 10,000 keys × 20 RPS = 100 million origin requests. With warming (preload top 1,000 hot keys) and 10 minute traffic ramp, origin load stays below 5,000 QPS throughout restart.
← Back to Cache Stampede Problem Overview
Failure Modes: Lock Leakage, Cold Cache, and Hot Key Attacks | Cache Stampede Problem - System Overflow