Caching • Distributed CachingMedium⏱️ ~3 min
Cache Eviction Policies, Admission Control, and Capacity Planning
Cache eviction determines which keys are removed when memory fills up, directly impacting hit ratio and backend load. The classic Least Recently Used (LRU) eviction removes the oldest accessed key, favoring temporal locality and working well for many workloads. However, LRU is vulnerable to cache pollution from scan traffic or one hit wonders: a burst of unique keys that are accessed once and never again can evict valuable hot keys, dropping hit rate by 10 to 20 percent in production. Modern admission policies like TinyLFU (Tiny Least Frequently Used) track key access frequency using a compact counting sketch and only admit new keys if they are accessed more frequently than the victim candidate, preventing cold keys from polluting the cache. Caffeine, a popular Java caching library used at scale, combines TinyLFU admission with a segmented LRU to achieve near optimal hit rates with minimal memory overhead. For cache clusters serving hundreds of gigabytes, even a 5 percent improvement in hit rate can reduce backend load by tens of thousands of queries per second.
Negative caching and size aware eviction are also important production techniques. Negative caching stores the fact that a key does not exist (cache the miss) with a very short TTL (seconds) to prevent repeated backend queries for non existent data, common in attacks or bugs. However, negative cache entries must have very short TTLs (under 10 seconds) to avoid masking newly created data. Size aware eviction accounts for object size when choosing victims: evicting one 10 MB object frees more memory than evicting 10,000 small 1 KB objects but may hurt hit rate if the large object is frequently accessed. Policies like Greedy Dual Size Frequency balance size, frequency, and recency. Production systems also enforce maximum object size limits (often 1 MB to 10 MB) to prevent a few large objects from dominating cache memory and evicting many hot small keys.
Capacity planning requires estimating the working set size from production traffic traces. A common approach is to measure the 95th or 99th percentile of unique keys accessed over sliding time windows (1 hour, 4 hours, 24 hours) and multiply by average object size plus overhead (typically 20 to 30 percent for metadata and fragmentation). For example, a service at 200,000 requests per second with 1 KB average value size and a target 99 percent hit rate might have a working set of 400 to 600 GB after replication and overhead. Maintaining 20 to 30 percent free memory headroom limits eviction churn and keeps hit rates stable under traffic spikes. Observability must track eviction rate (evictions per second), item churn (new insertions per second), memory fragmentation, per key size distribution, and correlate these with hit ratio and backend load. Automated alerts should fire when eviction rate exceeds thresholds or when backend query rate spikes, indicating that cache capacity is insufficient or that a scan workload is polluting the cache.
💡 Key Takeaways
•Least Recently Used (LRU) eviction works well for temporal locality but is vulnerable to scan pollution: a burst of one hit wonder keys can evict valuable hot keys and drop hit rate by 10 to 20 percent in production systems.
•TinyLFU admission policy tracks access frequency with compact counting sketches and only admits new keys if they are more popular than eviction candidates, preventing cache pollution. Caffeine library achieves near optimal hit rates using TinyLFU with segmented LRU.
•Negative caching stores non existent key results with very short TTLs (under 10 seconds) to prevent repeated backend queries during attacks or bugs. Longer TTLs risk masking newly created data and causing application errors.
•Size aware eviction balances object size, access frequency, and recency. Maximum object size limits (1 MB to 10 MB) prevent large objects from dominating memory and evicting many small hot keys. Greedy Dual Size Frequency is a common production policy.
•Capacity planning estimates working set size from traffic traces: 95th or 99th percentile of unique keys over time windows times average object size plus 20 to 30 percent overhead. A service at 200,000 requests per second with 1 KB values and 99 percent hit target needs 400 to 600 GB effective cache after replication.
•Maintain 20 to 30 percent free memory headroom to limit eviction churn and keep hit rates stable under traffic bursts. Track eviction rate, item churn, fragmentation, and per key size distribution, correlating with hit ratio and backend load for capacity alerts.
📌 Examples
A recommendation service uses TinyLFU admission: during a batch job that scans 10 million historical user profiles once, the cache rejects 95 percent of these cold keys at admission, preserving space for the 50,000 hot user keys that serve real time traffic at 100,000 requests per second.
Negative caching for user lookup: an API receives 10,000 requests per second for non existent user IDs during a credential stuffing attack. Caching the not found result with a 5 second TTL reduces database queries from 10,000 per second to 2,000 per second (one query per unique ID every 5 seconds).
Size aware eviction: a media metadata cache stores 1 KB thumbnail URLs and 5 MB video manifests. A scan of manifests would evict 5,000 thumbnail entries per manifest under pure LRU. Size aware policy considers that thumbnails are accessed 100 times more frequently and evicts manifests preferentially, maintaining hit rate on the hot thumbnail workload.