CachingEviction PoliciesHard⏱️ ~3 min

Failure Modes and Edge Cases in Cache Eviction

Even well designed eviction policies exhibit catastrophic failure modes under specific workload patterns or configuration mistakes. Understanding these edge cases is critical for operational reliability and capacity planning. Scan pollution is a classic failure: a sequential scan or analytics job that touches a huge keyset once can evict the entire working set under pure LRU, collapsing hit rate from 80% to near zero during the scan and causing a backend load spike. Mitigations include admission filters (TinyLFU rejects low frequency scan keys), segmented recency designs (scans flow through probation without disturbing protected), or tagging scan traffic and routing it to a separate cache or bypass tier. Another common failure is the thundering herd on expiration. When many keys expire simultaneously (for example, synchronized Time To Live (TTL) values set at application startup or batch load), the cache experiences a coordinated miss storm. All clients detect misses concurrently and issue backend queries, overwhelming the database. Mitigations include adding jitter to TTLs (for example, TTL = base plus random(0, 0.1 times base)), implementing soft TTLs with background refresh (serve stale data while asynchronously fetching fresh), and request coalescing (deduplicate concurrent identical requests so only one backend query is issued). Similarly, evicting a very popular item under memory pressure can trigger a stampede if many clients request it simultaneously; probabilistic early expiration and refresh ahead of eviction can prevent this. Slab fragmentation in per size class eviction systems like Memcache creates another failure mode. With fixed size allocation classes (for example, 64 byte, 128 byte, 256 byte slabs), hot growth in one size class may trigger aggressive evictions while free memory is stranded in other underutilized classes. For example, a sudden increase in 128 byte objects causes the 128 byte slab to evict frequently at 95% utilization while the 256 byte slab sits at 30% utilization. Rebalancing daemons help by moving pages between slab classes, but they lag behind traffic shifts and can cause sustained elevated evictions and tail latency spikes. Monitoring per slab utilization and eviction rates is essential. In Count Min Sketch based admission (TinyLFU), hash collisions inflate frequency estimates: with insufficient width or depth, popular and unpopular keys can hash to the same counters, causing false admissions or rejections. A CMS with width 10,000 and depth 4 limits error to under 1% to 2% relative frequency inflation for millions of keys, but undersized sketches degrade severely. Additionally, the sketch must be periodically reset to bound stale history; without resets, old frequencies dominate and the policy stops adapting to workload changes.
💡 Key Takeaways
Scan pollution under pure LRU: a sequential scan touching a huge keyset once evicts the entire working set, collapsing hit rate from 80% to near zero. Mitigations include admission filters (TinyLFU), segmented recency (2Q/SLRU), or routing scans to a separate tier.
Thundering herd on synchronized TTL expiration causes all keys to expire simultaneously, triggering a coordinated miss storm. Adding jitter (TTL = base plus random(0, 0.1 times base)) spreads expiration over time; soft TTLs with background refresh serve stale data while fetching.
Slab fragmentation in per size class eviction: hot growth in one slab class triggers aggressive eviction while memory is stranded in other classes. Example: 128 byte slab at 95% evicting while 256 byte slab at 30%. Rebalancing daemons help but lag behind traffic shifts.
Count Min Sketch hash collisions inflate frequency estimates, causing false admissions. A sketch with width 10,000 and depth 4 limits error to under 1% to 2%; undersized sketches degrade hit rate. Periodic sketch resets bound stale history and ensure adaptation.
Stale heavies in LFU without decay: once popular items retain high counts indefinitely and resist eviction even after access patterns change. Periodic counter halving or windowed sketches (TinyLFU) expire old frequencies.
Multi tenant interference: one hot tenant can monopolize cache and evict other tenants' working sets. Mitigations include per tenant partitions, dynamic reservations, or per tenant admission filters (separate TinyLFU per tenant).
📌 Examples
A production Memcache cluster experienced a 50% hit rate drop during nightly batch analytics. The batch job scanned millions of keys once, evicting the hot working set under LRU. Solution: route batch traffic to a separate cache tier with short TTL or bypass caching for identified scan endpoints.
An e-commerce site cached product data with synchronized 1 hour TTL. Every hour at :00, 100,000 keys expired simultaneously, causing 100,000 backend queries in under 1 second and database timeouts. Adding random(0, 300) second jitter spread expiration over 5 minutes, reducing peak load from 100,000 queries per second to 333 queries per second.
← Back to Eviction Policies Overview