Failure Modes and Edge Cases in LLM Caching
Staleness and Cache Invalidation
The most common production failure is serving stale cached responses after underlying data changes. Imagine an HR chatbot caching policy answers with a 24 hour TTL. If the vacation policy changes mid day from 15 to 20 days, cached responses will tell employees the old 15 day policy for up to 24 hours, causing confusion and potential compliance issues.
The fix requires tying cache keys to content versions. Instead of just caching by prompt, include a policy_version hash in the key. When policies update, the version changes, and old cache entries become unreachable. This works but requires coordinating your cache invalidation with your content deployment pipeline.
For dynamic data like stock prices or inventory, aggressive TTLs (10 to 30 seconds) are necessary. But this tanks your hit rate because cache entries expire before they can be reused. You end up with cache overhead (key generation, lookup latency) but minimal benefit.
Semantic Drift in Approximate Matching
Semantic caching fails catastrophically on prompts that embed similarly but require different answers. Consider two questions: "What's our refund window for electronics?" and "What's our refund window for groceries?" Both embed nearly identically (cosine similarity 0.94) because they share structure and most words. If the actual policies differ (30 days for electronics, 7 days for groceries), a semantic cache with threshold 0.90 will return the wrong cached answer 50 percent of the time depending on which question was cached first. This type of subtle semantic drift is almost impossible to detect automatically without expensive verification. The solution is domain specific filtering. Before returning a cached hit, extract key entities ("electronics" versus "groceries") and verify they match. This adds 20 to 50 milliseconds of latency and complexity, reducing the benefit of caching. For high risk domains, the conservative choice is to disable semantic caching entirely and rely only on exact matches.
KV Cache Memory Exhaustion
KV caching increases GPU memory pressure because you store tensors for every active conversation. In a chat system serving 10,000 concurrent users, each with 2,000 token history, you might need 5 to 10 Gigabytes (GB) of GPU memory just for KV cache, leaving less room for model weights and batch processing. When memory fills up, providers must either evict KV entries (forcing recomputation later) or swap them to host memory. Swapping can increase p99 latency from 800 milliseconds to 5 seconds because loading KV tensors from host memory over PCIe is slow. This creates a latency cliff under load that is hard to predict and debug. Paged KV techniques help by splitting cache into fixed size blocks that can be swapped granularly, but this adds scheduling complexity. The system must decide which blocks to keep in GPU memory based on usage patterns, essentially implementing an LRU (Least Recently Used) cache inside the GPU.
Plan Cache False Positives
Agentic plan caching breaks when keyword extraction is too coarse. If two tasks map to the same keyword but need different plans, the system executes the wrong workflow. For example, "analyze Q3 earnings growth" and "forecast Q4 earnings growth" might both extract keyword "earnings growth" but require different plans (historical analysis versus forward prediction). The research uses exact keyword matching to avoid fuzzy false positives, but this lowers hit rate to around 47 percent. There is no free lunch: tighter matching means more cache misses and more expensive planner invocations. Looser matching means higher hit rate but more execution failures. Another edge case is cached plans encoding outdated assumptions. If a plan from six months ago calls an API endpoint that has since changed or uses tools that have been deprecated, high cache hit rate actually causes high task failure rate. You need explicit plan versioning and periodic revalidation or retraining of cached templates.
Quality Degradation Under Routing
Aggressive model routing can harm user experience in subtle ways that average metrics miss. Imagine routing 80 percent of coding questions to a cheaper model that performs well on simple syntax but poorly on algorithms. Your overall accuracy might be 94 percent (high), but the 20 percent of users asking hard questions see 75 percent accuracy (terrible). This creates a bimodal quality distribution where some users have great experiences and others have awful experiences, even though the average looks acceptable. Detecting this requires per cohort or per query type quality tracking, not just global metrics. The mitigation is conservative routing policies with escape hatches. Start by routing only the most obviously simple queries (like greetings or simple lookups). Monitor quality closely. Add a feedback mechanism so users can flag bad responses, triggering a retry with the premium model. This caps the damage from misrouting.