ML Model OptimizationModel Caching (Embedding Cache, Result Cache)Hard⏱️ ~3 min

Semantic Result Cache: Architecture and Similarity Thresholds

Semantic result caching goes beyond exact key matching by using embedding similarity to reuse answers when user intent matches, even if wording differs. This dramatically increases hit rates but introduces correctness risks that require careful tuning. The architecture involves computing or looking up the embedding for an incoming prompt, then running approximate nearest neighbor search across a cache of previous prompts and their responses. With HNSW style indexes and 768 to 1536 dimensional vectors, p95 search latency stays at 5 to 20 milliseconds even with 10 to 100 million cached prompts on commodity CPUs. The critical parameter is the cosine similarity threshold. Closed domains like internal enterprise tools often require 0.85 to 0.95 similarity to ensure the cached answer actually addresses the new question. Open domain systems might relax to 0.7, accepting higher false positive rates for better hit rates. The tradeoff is hit rate versus accuracy. Loose thresholds like 0.7 can push combined exact plus semantic hit rates to 40 or 50 percent but risk serving wrong answers. A customer asking about return policy for electronics should not receive a cached answer about clothing returns, even if both questions score 0.75 similarity. Tight thresholds like 0.9 cut false positives but reduce the benefit to only 5 to 10 percent additional hits beyond exact cache. Production systems often start conservative at 0.9, measure false positive rates using a small verifier model, then gradually relax if quality holds. Google and Bing cache popular search queries with semantic matching to shave tens of milliseconds at web scale, using second level Time To Live (TTL) values to balance freshness and hit rate. The key insight is that semantic caching is a latency and cost optimization, not a correctness feature. Always enforce metadata alignment (same tenant, same locale, same tool configuration) and minimum prompt length to avoid collisions on trivial inputs like hi or thanks.
💡 Key Takeaways
Cosine similarity thresholds control the tradeoff between hit rate and false positives. Enterprise tools use 0.85 to 0.95 in closed domains for safety, open domain systems might use 0.7 accepting higher error rates for 40 to 50 percent total hit rates.
HNSW indexes enable approximate nearest neighbor search at 5 to 20 milliseconds p95 latency across 10 to 100 million cached prompts on CPUs. Index build time and memory are the primary cost, not query latency at this scale.
Metadata alignment is mandatory to prevent cross contamination. Include tenant identifier, locale, tool configuration hash, and safety settings in the matching logic beyond just vector similarity to avoid serving electronics return policy for clothing questions.
Minimum prompt length checks prevent collisions on ambiguous short inputs. Phrases like hi, thanks, or help often score high similarity but have wildly different contexts. Require at least 5 to 10 words or use a domain classifier.
Verifier models measure false positive rates in production. Run a lightweight model to check if the cached answer actually addresses the new prompt. Track disagreements as your false positive metric and use it to tune thresholds.
Time To Live (TTL) values balance freshness and savings. Static content like FAQs can cache for weeks. Prices or inventory need 30 to 120 second TTL. Google and Bing use second level TTL for popular queries to absorb web scale traffic spikes.
📌 Examples
An enterprise chatbot with 0.9 similarity threshold sees 15 percent semantic hit rate beyond 25 percent exact hits, saving $40K monthly in API costs at 10 million requests. Lowering to 0.8 pushes hits to 50 percent total but introduces 8 percent false positive rate detected by verifier.
Netflix recommendation explanations cache semantically similar prompts about why a show was recommended. With 0.88 threshold and genre plus rating metadata alignment, they achieve 22 percent additional hits with under 2 percent user reported mismatches.
A support system caching 50 million FAQ prompts uses two tier lookup: exact hash first (20ms p50), then HNSW search (12ms p95). Combined hit rate of 38 percent eliminates 4 out of 10 model calls, reducing p95 end to end latency from 1.8 seconds to 800 milliseconds for cache hits.
← Back to Model Caching (Embedding Cache, Result Cache) Overview
Semantic Result Cache: Architecture and Similarity Thresholds | Model Caching (Embedding Cache, Result Cache) - System Overflow