Learn→ML Model Optimization→Model Caching (Embedding Cache, Result Cache)→5 of 6

ML Model Optimization • Model Caching (Embedding Cache, Result Cache)Hard⏱️ ~3 min

Failure Modes: Cache Stampede, Embedding Drift, and False Positives

CACHE STAMPEDE
Cache stampede occurs when a popular cache entry expires and hundreds of concurrent requests all miss cache simultaneously. All requests hit the model at once, potentially overwhelming it. For ML systems this is especially dangerous—model inference is expensive, so a stampede can cascade into complete service degradation.
Prevention strategies:
Probabilistic early refresh: Each request has a small probability of refreshing the cache before TTL expires. Spreads refresh load over time instead of concentrating at expiration.
Single-flight pattern: When cache misses, only one request actually computes. Others wait for that result. Requires coordination (mutex, semaphore) but eliminates duplicate computation.
Stale-while-revalidate: Serve stale result immediately while triggering background refresh. User gets fast response, cache gets updated asynchronously. Trades freshness for availability.
STALE CACHE SERVING
Stale results happen when cached data no longer reflects current model behavior or world state. Recommendation system returns cached suggestions for products now out of stock. Model was updated but old predictions still served from cache.
Detection: monitor cache age distribution and compare cached vs fresh results on sampled traffic. If divergence exceeds threshold, cache is too stale. Set up automatic invalidation triggers based on detected staleness. Alert when stale serving rate exceeds your SLO (e.g., >5% of responses older than 1 hour).
CACHE POISONING
Cache poisoning stores incorrect results that get served repeatedly to many users. In ML systems, this happens when model returns an error response that gets cached, or when adversarial input produces a bad cached result. Semantic caching adds risk—one poisoned entry can affect all similar queries through approximate matching.
Defenses: validate model outputs before caching (sanity checks on format, confidence scores, content). Use shorter TTL for uncertain predictions. Never cache error responses.
❗ Critical: Never cache error responses. A model timeout cached and served for hours is far worse than repeated timeouts. Always validate response format and confidence before storing.

💡 Key Takeaways

✓Cache stampede: popular entry expires, hundreds of requests hit model simultaneously

✓Prevent stampede with probabilistic early refresh, single-flight, or stale-while-revalidate

✓Detect staleness by comparing cached vs fresh results on sampled traffic

✓Never cache error responses—validate outputs before storing

📌 Interview Tips

1Interview Tip: Explain the stampede prevention trifecta—probabilistic refresh spreads load, single-flight dedupes, stale-while-revalidate prioritizes availability.

2Interview Tip: Describe cache poisoning risk in semantic caching—one bad entry affects all similar queries through approximate matching.

← Back to Model Caching (Embedding Cache, Result Cache) Overview