Learn→Model Serving & Inference→Monitoring & Observability (Latency, Drift, Performance)→5 of 6

Model Serving & Inference • Monitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~3 min

Semantic Caching and Retrieval Invalidation

How Semantic Caching Works
Semantic caching delivers dramatic speedups and cost savings when queries repeat or cluster semantically. Instead of exact string matching, semantic caches embed incoming queries and retrieve prior answers when embedding distance falls below a threshold (commonly cosine similarity greater than 0.85 to 0.95). Production systems report up to 17x speedups when prompts repeat semantically, turning a 2 second generation into an instant cache hit under 100ms. This directly reduces cost per 1000 tokens by avoiding repeated LLM inference.
Freshness vs Hit Rate
Cached answers can propagate outdated information when underlying source documents or indexes change. The mitigation is embedding drift aware invalidation with TTL policies tuned by domain. High volatility domains like news or pricing use short TTLs (minutes to hours), while stable domains like documentation use longer TTLs (days to weeks). When the retrieval index refreshes or source documents update, invalidate impacted cache entries by tracking document identifiers or embedding clusters.
Implementation Pattern
Store query embeddings, generated responses, and metadata (timestamp, source document identifiers, model version) in a low latency key value store. On each incoming query, compute the embedding (2 to 10ms), perform ANN lookup in the cache (sub millisecond to 10ms), and return the cached response if distance is below threshold and TTL has not expired. Otherwise, proceed with full retrieval and generation, then insert the new result into the cache.
Cache Pollution Failure
When a vendor model update changes output style or refusal behavior, cached answers from the old model persist until TTL expires, creating inconsistent user experience. Cache pollution occurs when low quality or incorrect answers get cached, propagating errors until manual invalidation. The mitigation is versioning cache entries by model identifier and prompt template hash, automatically invalidating old entries on model or template changes.

💡 Key Takeaways

✓Semantic caching with embedding similarity threshold (commonly cosine greater than 0.85 to 0.95) delivers up to 17× speedups, reducing 2 second generation to under 100 millisecond cache hit

✓Trade off between hit rate and freshness: high volatility domains (news, pricing) use short TTLs (minutes to hours), stable domains (docs, historical question answer) use longer TTLs (days to weeks)

✓Embed incoming query (2 to 10 milliseconds), perform ANN lookup in cache (sub millisecond to 10 milliseconds), return cached response if distance below threshold and TTL valid, otherwise proceed with full pipeline

✓Version cache entries by model identifier and prompt template hash to automatically invalidate old entries on model or template changes, preventing inconsistent user experience from semantic drift

✓Monitor cache hit rate, staleness rate (fraction flagged outdated by user feedback), and re ask rate per cache hit to detect and purge low quality cached responses proactively

📌 Interview Tips

1Netflix recommendation explanations: semantic cache with 0.90 similarity threshold achieved 65 percent hit rate on similar queries, reducing LLM inference cost by $40K per month

2Airbnb search assistant: 24 hour TTL for pricing queries, 7 day TTL for neighborhood descriptions, invalidate on index refresh, staleness rate under 2 percent with drift detection integration

3Uber customer support: vendor model update from v1 to v2 changed refusal style, cached v1 answers persisted for 12 hours until TTL expired, creating inconsistent responses, fixed by versioning cache keys with model identifier

4Meta content moderation: cache pollution from false positives in early rollout, tracked thumbs down rate per cache hit, purged entries with negative feedback greater than 10 percent within 1 hour

← Back to Monitoring & Observability (Latency, Drift, Performance) Overview