Model Serving & Inference • Monitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~3 min
Semantic Caching and Retrieval Invalidation
Semantic caching delivers dramatic speedups and cost savings when queries repeat or cluster semantically. Instead of exact string matching, semantic caches embed incoming queries and retrieve prior answers when embedding distance falls below a threshold (commonly cosine similarity greater than 0.85 to 0.95). Production systems report up to 17× speedups when prompts repeat semantically, turning a 2 second generation into an instant cache hit under 100 milliseconds. This directly reduces cost per 1000 tokens by avoiding repeated Large Language Model (LLM) inference and retrieval computation.
The trade off is freshness versus hit rate. Cached answers can propagate outdated information when underlying source documents or indexes change. The mitigation is embedding drift aware invalidation with time to live (TTL) policies tuned by domain. High volatility domains like news or pricing use short TTLs (minutes to hours), while stable domains like documentation or historical question answer use longer TTLs (days to weeks). When the retrieval index refreshes or source documents update, invalidate impacted cache entries by tracking document identifiers or embedding clusters. Netflix and Airbnb use this pattern to balance cache efficiency with data freshness, monitoring cache hit rate and staleness incidents as key metrics.
Implementation requires storing query embeddings, generated responses, and metadata (timestamp, source document identifiers, model version) in a low latency key value store. On each incoming query, compute the embedding (2 to 10 milliseconds), perform Approximate Nearest Neighbor (ANN) lookup in the cache (sub millisecond to 10 milliseconds), and return the cached response if distance is below threshold and TTL has not expired. Otherwise, proceed with full retrieval and generation, then insert the new result into the cache. Monitor cache hit rate, average retrieval time on cache miss, and staleness rate (fraction of cache hits later flagged as outdated by user feedback or drift detection).
Failure modes include semantic drift in cached answers and cache pollution. When a vendor model update changes output style or refusal behavior, cached answers from the old model persist until TTL expires, creating inconsistent user experience. Cache pollution occurs when low quality or incorrect answers get cached, propagating errors until manual invalidation. The mitigation is versioning cache entries by model identifier and prompt template hash, automatically invalidating old entries on model or template changes. Additionally, track user feedback signals (thumbs down, re ask rate) per cache hit to detect and purge low quality cached responses proactively.
💡 Key Takeaways
•Semantic caching with embedding similarity threshold (commonly cosine greater than 0.85 to 0.95) delivers up to 17× speedups, reducing 2 second generation to under 100 millisecond cache hit
•Trade off between hit rate and freshness: high volatility domains (news, pricing) use short TTLs (minutes to hours), stable domains (docs, historical question answer) use longer TTLs (days to weeks)
•Embed incoming query (2 to 10 milliseconds), perform ANN lookup in cache (sub millisecond to 10 milliseconds), return cached response if distance below threshold and TTL valid, otherwise proceed with full pipeline
•Version cache entries by model identifier and prompt template hash to automatically invalidate old entries on model or template changes, preventing inconsistent user experience from semantic drift
•Monitor cache hit rate, staleness rate (fraction flagged outdated by user feedback), and re ask rate per cache hit to detect and purge low quality cached responses proactively
📌 Examples
Netflix recommendation explanations: semantic cache with 0.90 similarity threshold achieved 65 percent hit rate on similar queries, reducing LLM inference cost by $40K per month
Airbnb search assistant: 24 hour TTL for pricing queries, 7 day TTL for neighborhood descriptions, invalidate on index refresh, staleness rate under 2 percent with drift detection integration
Uber customer support: vendor model update from v1 to v2 changed refusal style, cached v1 answers persisted for 12 hours until TTL expired, creating inconsistent responses, fixed by versioning cache keys with model identifier
Meta content moderation: cache pollution from false positives in early rollout, tracked thumbs down rate per cache hit, purged entries with negative feedback greater than 10 percent within 1 hour