Loading...
LLM & Generative AI SystemsLLM Caching & Cost OptimizationHard⏱️ ~3 min

Failure Modes and Edge Cases in LLM Caching

Staleness and Cache Invalidation: The most common production failure is serving stale cached responses after underlying data changes. Imagine an HR chatbot caching policy answers with a 24 hour TTL. If the vacation policy changes mid day from 15 to 20 days, cached responses will tell employees the old 15 day policy for up to 24 hours, causing confusion and potential compliance issues. The fix requires tying cache keys to content versions. Instead of just caching by prompt, include a policy_version hash in the key. When policies update, the version changes, and old cache entries become unreachable. This works but requires coordinating your cache invalidation with your content deployment pipeline. For dynamic data like stock prices or inventory, aggressive TTLs (10 to 30 seconds) are necessary. But this tanks your hit rate because cache entries expire before they can be reused. You end up with cache overhead (key generation, lookup latency) but minimal benefit.
❗ Remember: Cache invalidation is one of the two hard problems in computer science. For LLM caching, always version your content or use conservative TTLs shorter than your acceptable staleness window.
Semantic Drift in Approximate Matching: Semantic caching fails catastrophically on prompts that embed similarly but require different answers. Consider two questions: "What's our refund window for electronics?" and "What's our refund window for groceries?" Both embed nearly identically (cosine similarity 0.94) because they share structure and most words. If the actual policies differ (30 days for electronics, 7 days for groceries), a semantic cache with threshold 0.90 will return the wrong cached answer 50 percent of the time depending on which question was cached first. This type of subtle semantic drift is almost impossible to detect automatically without expensive verification. The solution is domain specific filtering. Before returning a cached hit, extract key entities ("electronics" versus "groceries") and verify they match. This adds 20 to 50 milliseconds of latency and complexity, reducing the benefit of caching. For high risk domains, the conservative choice is to disable semantic caching entirely and rely only on exact matches. KV Cache Memory Exhaustion: KV caching increases GPU memory pressure because you store tensors for every active conversation. In a chat system serving 10,000 concurrent users, each with 2,000 token history, you might need 5 to 10 Gigabytes (GB) of GPU memory just for KV cache, leaving less room for model weights and batch processing. When memory fills up, providers must either evict KV entries (forcing recomputation later) or swap them to host memory. Swapping can increase p99 latency from 800 milliseconds to 5 seconds because loading KV tensors from host memory over PCIe is slow. This creates a latency cliff under load that is hard to predict and debug. Paged KV techniques help by splitting cache into fixed size blocks that can be swapped granularly, but this adds scheduling complexity. The system must decide which blocks to keep in GPU memory based on usage patterns, essentially implementing an LRU (Least Recently Used) cache inside the GPU.
KV Cache Memory Pressure
NORMAL
800ms p99
SWAP TO HOST
5000ms p99
Plan Cache False Positives: Agentic plan caching breaks when keyword extraction is too coarse. If two tasks map to the same keyword but need different plans, the system executes the wrong workflow. For example, "analyze Q3 earnings growth" and "forecast Q4 earnings growth" might both extract keyword "earnings growth" but require different plans (historical analysis versus forward prediction). The research uses exact keyword matching to avoid fuzzy false positives, but this lowers hit rate to around 47 percent. There is no free lunch: tighter matching means more cache misses and more expensive planner invocations. Looser matching means higher hit rate but more execution failures. Another edge case is cached plans encoding outdated assumptions. If a plan from six months ago calls an API endpoint that has since changed or uses tools that have been deprecated, high cache hit rate actually causes high task failure rate. You need explicit plan versioning and periodic revalidation or retraining of cached templates. Quality Degradation Under Routing: Aggressive model routing can harm user experience in subtle ways that average metrics miss. Imagine routing 80 percent of coding questions to a cheaper model that performs well on simple syntax but poorly on algorithms. Your overall accuracy might be 94 percent (high), but the 20 percent of users asking hard questions see 75 percent accuracy (terrible). This creates a bimodal quality distribution where some users have great experiences and others have awful experiences, even though the average looks acceptable. Detecting this requires per cohort or per query type quality tracking, not just global metrics. The mitigation is conservative routing policies with escape hatches. Start by routing only the most obviously simple queries (like greetings or simple lookups). Monitor quality closely. Add a feedback mechanism so users can flag bad responses, triggering a retry with the premium model. This caps the damage from misrouting.
💡 Key Takeaways
Staleness failures occur when cache TTL exceeds content update frequency, requiring content versioning in cache keys or aggressive TTLs that hurt hit rates
Semantic caching with similarity threshold 0.90 can return wrong answers for prompts that embed similarly (0.94 cosine similarity) but need different responses, requiring entity level validation
KV cache memory pressure causes latency cliffs: p99 jumps from 800 milliseconds to 5 seconds when GPU memory fills and system swaps tensors to host memory
Agentic plan cache false positives happen when distinct tasks map to the same keyword, and stale cached plans can encode outdated API assumptions causing execution failures
Model routing creates bimodal quality distribution where average metrics look good but specific user cohorts experience significantly degraded performance
📌 Examples
1HR chatbot serves 15 day vacation policy from cache for 18 hours after policy updated to 20 days because TTL was 24 hours and key did not include policy version
2E-commerce refund bot with semantic cache (0.90 threshold) returns electronics policy (30 days) for grocery question because queries embedded at 0.94 similarity despite different answers
3Chat system at 10K concurrent users fills 8GB GPU memory with KV cache, forcing swaps to host memory that spike p99 from 600ms to 4 seconds during peak traffic
4Coding assistant routes 80% to cheap model, overall accuracy is 93%, but algorithmic questions (20% of traffic) drop to 72% accuracy, hurting expert users
← Back to LLM Caching & Cost Optimization Overview
Loading...
Failure Modes and Edge Cases in LLM Caching | LLM Caching & Cost Optimization - System Overflow