Failure Modes and Edge Cases in Cache Eviction
Scan Pollution
A sequential scan or analytics job touching a huge keyset once can evict the entire working set under pure LRU, collapsing hit rate from 80% to near zero. The scan brings each key to MRU (most recently used) position, pushing hot items out. Mitigations: admission filters (TinyLFU rejects low frequency scan keys), segmented designs (scans flow through probation without disturbing protected), or routing scan traffic to a separate tier.
Thundering Herd on Expiration
When many keys expire simultaneously (synchronized TTL values from batch load or startup), all clients detect misses concurrently and issue backend queries, overwhelming the database. Adding jitter spreads expiration: TTL = base + random(0, 0.1 * base). With 100,000 keys and 1 hour TTL, synchronized expiration causes 100,000 queries in under 1 second. Adding 5 minute jitter spreads load to 333 queries per second.
Slab Fragmentation
Per size class allocation systems use fixed size slabs (64 byte, 128 byte, 256 byte). Hot growth in one class triggers aggressive eviction while free memory is stranded elsewhere. The 128 byte slab at 95% evicting while 256 byte slab sits at 30%. Rebalancing daemons help but lag traffic shifts, causing elevated evictions and tail latency spikes. Monitor per slab utilization and eviction rates.
Sketch Sizing and Stale Frequencies
In Count Min Sketch based admission, undersized sketches cause hash collisions that inflate frequency estimates, leading to false admissions. Width 10,000 depth 4 limits error to under 2%. Without periodic resets, old frequencies dominate and the policy stops adapting. In LFU without decay, once popular items retain high counts indefinitely (stale heavies), resisting eviction after patterns change.