Production Scale Caching Implementation
Building the Cache Key
The cache key determines what makes two requests "the same." A naive implementation might just hash the user prompt, but this breaks immediately in production. You need to normalize all inputs that affect model output.
A proper cache key includes: the user prompt with whitespace and formatting normalized, the complete system prompt or instructions, model name and version (because gpt-4-0613 and gpt-4-1106 behave differently), temperature and other sampling parameters like top_p and max_tokens, any tool or function definitions if using function calling, and potentially user or tenant ID if responses are personalized.
Missing any of these causes cache pollution. Imagine caching a response generated with temperature=0.7 and serving it for a request with temperature=0. The deterministic request gets a potentially creative response, violating user expectations.
Time to Live (TTL) Strategy
TTL selection directly trades freshness for hit rate. For static content like product documentation, you might use 24 hour TTL and invalidate explicitly when docs are updated by including a document version hash in the cache key. For dynamic content like stock prices, TTL might be 30 seconds or even lower. A common pattern is tiered TTLs based on query type. FAQ responses get 6 hour TTL. Personalized recommendations get 5 minute TTL. Real time data queries get 10 second TTL. Each tier balances staleness risk against cost savings.
Semantic Caching Details
Semantic caching uses embedding similarity to match prompts that are conceptually similar but textually different. "How do I reset my password?" and "I forgot my password, help" should return the same cached response.
Implementation stores embeddings (typically 768 or 1536 dimensions from models like text-embedding-ada-002) in a vector database. At query time, you embed the incoming prompt, search for nearest neighbors, and check if the top match exceeds your similarity threshold (often 0.85 to 0.95 cosine similarity).
The threshold is critical. Too low (0.7) and you risk false positives where subtly different questions get the wrong cached answer. Too high (0.98) and hit rate drops to nearly zero because only nearly identical prompts match. You must tune this per use case and monitor precision.
Low precision
High precision
Observability and Metrics
You cannot optimize what you cannot measure. Track these metrics per caching layer: cache hit rate (hits divided by total requests), cost saved per 1,000 requests (compute hits times average LLM cost), latency percentiles split by cache hit versus miss, token usage breakdown by cached versus fresh, and staleness incidents (when cached responses were detectably wrong). Set up alerts for sudden hit rate drops (cache failure or query pattern shift), p99 latency spikes (cache overload or provider issues), and cost anomalies (cache bypass due to bugs). A 10 percent drop in hit rate can mean tens of thousands of dollars in extra daily costs at scale.
Rollout Strategy
Start with exact prompt caching only, measure baseline hit rate for 1 to 2 weeks, then add semantic caching for high volume intents where approximate matches are safe. Use feature flags to control which query types use which caching strategy. Roll out gradually: 5 percent of traffic, then 25 percent, then 100 percent, monitoring quality metrics at each stage.