Semantic Caching and Retrieval Invalidation
How Semantic Caching Works
Semantic caching delivers dramatic speedups and cost savings when queries repeat or cluster semantically. Instead of exact string matching, semantic caches embed incoming queries and retrieve prior answers when embedding distance falls below a threshold (commonly cosine similarity greater than 0.85 to 0.95). Production systems report up to 17x speedups when prompts repeat semantically, turning a 2 second generation into an instant cache hit under 100ms. This directly reduces cost per 1000 tokens by avoiding repeated LLM inference.
Freshness vs Hit Rate
Cached answers can propagate outdated information when underlying source documents or indexes change. The mitigation is embedding drift aware invalidation with TTL policies tuned by domain. High volatility domains like news or pricing use short TTLs (minutes to hours), while stable domains like documentation use longer TTLs (days to weeks). When the retrieval index refreshes or source documents update, invalidate impacted cache entries by tracking document identifiers or embedding clusters.
Implementation Pattern
Store query embeddings, generated responses, and metadata (timestamp, source document identifiers, model version) in a low latency key value store. On each incoming query, compute the embedding (2 to 10ms), perform ANN lookup in the cache (sub millisecond to 10ms), and return the cached response if distance is below threshold and TTL has not expired. Otherwise, proceed with full retrieval and generation, then insert the new result into the cache.
Cache Pollution Failure
When a vendor model update changes output style or refusal behavior, cached answers from the old model persist until TTL expires, creating inconsistent user experience. Cache pollution occurs when low quality or incorrect answers get cached, propagating errors until manual invalidation. The mitigation is versioning cache entries by model identifier and prompt template hash, automatically invalidating old entries on model or template changes.