Cache Pattern Failure Modes: What Breaks in Production
Thundering Herd: Synchronized Stampede
Thundering herd (also called cache stampede) occurs when many clients simultaneously experience a cache miss on the same key and all attempt to fetch the backing data at once. This happens when a TTL expires, during mass invalidation events, or after a cache cold start. Consider a hot key serving 10,000 requests per second. When its TTL expires, every application server simultaneously detects the miss and issues a database query. The database, designed to handle perhaps 100 queries per second with cache absorbing the rest, suddenly receives 10,000 concurrent requests. Response times degrade from milliseconds to seconds. Connections exhaust. Cascading timeouts propagate upstream.
Mitigating Thundering Herds
Mitigation requires breaking the synchronization. Lease tokens ensure only one requester fetches while others wait for the result: on miss, acquire a per key lease with short TTL (5-10 seconds); if lease already held, wait briefly then retry cache get. Probabilistic early refresh triggers background updates before TTL expiry with random timing. TTL jitter (randomizing expiry by 10-20%) prevents fleet wide synchronization where all servers expire the same key at the same moment.
Stale Read Race Condition
The naive update cache on write pattern creates a subtle race condition. Thread A reads stale data from database during a slow query taking 100ms. While Thread A waits, Thread B writes new data to both database and cache. Thread A completes and overwrites the cache with its stale result. Now the cache serves incorrect data until TTL expiry. This is why production systems use delete on write: write to database, then delete the cache key, forcing the next reader to fetch fresh data. Combined with version stamps (each cached value carries a version number), the cache can reject writes of older versions.
Write Back Durability Failure
In write back systems, the cache node holds uncommitted writes in a buffer. If the node crashes before flushing to the database, those writes vanish. A 5 second flush interval means up to 5 seconds of data loss per node failure. For counters receiving 1,000 increments per second, that is 5,000 lost increments. Mitigation: replicate the write buffer across nodes before acknowledging, persist to a local WAL (Write Ahead Log) that survives crashes, and use idempotent writes so replay is safe.
Cold Cache Cascade
After a deployment, restart, or cache failure, the cache is empty. Every request misses and hits the database. If the database cannot handle full load without cache (often true since cache absorbs 80-95% of reads), it overloads and the system fails to recover. Solutions: cache warming (preloading critical hot keys before serving traffic), gradual traffic ramp up over 5-10 minutes, and request shedding (dropping low priority requests under extreme load while preserving high priority requests).