Production Scale Caching Implementation

Building the Cache Key
The cache key determines what makes two requests "the same." A naive implementation might just hash the user prompt, but this breaks immediately in production. You need to normalize all inputs that affect model output.

A proper cache key includes: the user prompt with whitespace and formatting normalized, the complete system prompt or instructions, model name and version (because gpt-4-0613 and gpt-4-1106 behave differently), temperature and other sampling parameters like top_p and max_tokens, any tool or function definitions if using function calling, and potentially user or tenant ID if responses are personalized.

Missing any of these causes cache pollution. Imagine caching a response generated with temperature=0.7 and serving it for a request with temperature=0. The deterministic request gets a potentially creative response, violating user expectations.
Time to Live (TTL) Strategy
TTL selection directly trades freshness for hit rate. For static content like product documentation, you might use 24 hour TTL and invalidate explicitly when docs are updated by including a document version hash in the cache key. For dynamic content like stock prices, TTL might be 30 seconds or even lower.

A common pattern is tiered TTLs based on query type. FAQ responses get 6 hour TTL. Personalized recommendations get 5 minute TTL. Real time data queries get 10 second TTL. Each tier balances staleness risk against cost savings.
⚠️ Common Pitfall: Setting TTL too high without versioning leads to serving stale answers after content updates. Always tie cache invalidation to your content deployment pipeline or include content version in the key.
Semantic Caching Details
Semantic caching uses embedding similarity to match prompts that are conceptually similar but textually different. "How do I reset my password?" and "I forgot my password, help" should return the same cached response.

Implementation stores embeddings (typically 768 or 1536 dimensions from models like text-embedding-ada-002) in a vector database. At query time, you embed the incoming prompt, search for nearest neighbors, and check if the top match exceeds your similarity threshold (often 0.85 to 0.95 cosine similarity).

The threshold is critical. Too low (0.7) and you risk false positives where subtly different questions get the wrong cached answer. Too high (0.98) and hit rate drops to nearly zero because only nearly identical prompts match. You must tune this per use case and monitor precision.
Semantic Caching Threshold Trade-off
THRESHOLD 0.7
High hits
Low precision
↔
THRESHOLD 0.95
Low hits
High precision
Observability and Metrics
You cannot optimize what you cannot measure. Track these metrics per caching layer: cache hit rate (hits divided by total requests), cost saved per 1,000 requests (compute hits times average LLM cost), latency percentiles split by cache hit versus miss, token usage breakdown by cached versus fresh, and staleness incidents (when cached responses were detectably wrong).

Set up alerts for sudden hit rate drops (cache failure or query pattern shift), p99 latency spikes (cache overload or provider issues), and cost anomalies (cache bypass due to bugs). A 10 percent drop in hit rate can mean tens of thousands of dollars in extra daily costs at scale.
Rollout Strategy
Start with exact prompt caching only, measure baseline hit rate for 1 to 2 weeks, then add semantic caching for high volume intents where approximate matches are safe. Use feature flags to control which query types use which caching strategy. Roll out gradually: 5 percent of traffic, then 25 percent, then 100 percent, monitoring quality metrics at each stage.

💡 Key Takeaways

✓Cache keys must include all determinism factors: prompt, system instructions, model version, temperature, and sampling parameters to avoid cache pollution

✓TTL strategy varies by use case: 24 hours for static docs, 5 minutes for personalized content, 10 seconds for real time data

✓Semantic caching threshold determines trade-off between hit rate and precision: 0.85 to 0.95 cosine similarity is typical, requiring careful tuning per domain

✓Production observability requires tracking hit rate, cost saved, latency split by hit/miss, and staleness incidents, with alerts on 10 percent hit rate drops

📌 Interview Tips

1E-commerce product question cache: key includes product_id and doc_version, 12 hour TTL, invalidate on product update, achieving 35% hit rate saving $4K daily

2Financial chatbot: semantic caching for earnings questions with 0.90 threshold, 45% hit rate, but disabled for regulatory queries requiring exact answers

3Internal HR tool: exact caching for policy FAQs with policy_version in key, 24 hour TTL bumped to 1 week after stabilization, 60% hit rate

← Back to LLM Caching & Cost Optimization Overview