Loading...
LLM & Generative AI SystemsLLM Caching & Cost OptimizationMedium⏱️ ~3 min

Three Layers of LLM Caching

Understanding the Caching Stack: Production LLM systems use multiple caching layers working together, each optimizing different aspects of the inference pipeline. Understanding where and how each layer operates is crucial for interview discussions about system design.
1
Application Layer Response Caching: Stores complete prompt and response pairs. When an identical or semantically similar request arrives, the system returns the cached response without calling the model at all. This is the most visible and controllable layer for application developers.
2
Provider Level Key Value (KV) Cache: Stores attention key and value tensors for tokens that have already been processed. When generating the next token in a conversation, the model reuses these cached tensors instead of recomputing attention for the entire history. This is mostly transparent to developers but crucial for throughput.
3
Agentic Plan Caching: Caches high level reasoning structures or workflows for complex multi-step tasks. Instead of caching raw text, this stores abstracted plans that can be adapted to new contexts, reducing expensive planner model invocations.
How KV Caching Works: For interactive chat, KV caching provides massive savings. Imagine a 10 turn conversation where each message adds 50 tokens. Without KV caching, generating token 501 (first token of turn 11) would require computing attention over all 500 previous tokens. With KV caching, providers like OpenAI and Anthropic keep those attention tensors in GPU memory, so only the new user message needs full forward passes. This cuts per token compute by 30 to 70 percent depending on conversation length. A chat session with 2,000 tokens of history might see 60 percent of compute eliminated through KV reuse. The technique uses paged memory management to keep more concurrent users per GPU by swapping rarely accessed keys to host memory when GPU memory gets tight. Agentic Plan Caching Example: Consider a financial analysis agent that must generate earnings reports. A large planner model (2 to 3 seconds p95 latency) creates a multi-step plan: fetch earnings data, calculate growth metrics, compare to sector average, generate summary. A smaller actor model (200 to 400 milliseconds p95) executes each step. Without plan caching, every new ticker requires calling the expensive planner. With plan caching, the system extracts task intent ("quarterly earnings analysis"), looks up a cached generic plan, and adapts it with a cheaper model by filling in the specific ticker and date. Research shows this reduces planner invocations by roughly 47 percent while maintaining about 97 percent of baseline accuracy.
"The key insight: different caching layers optimize different bottlenecks. Response caching eliminates redundant LLM calls entirely. KV caching reduces compute per token. Plan caching cuts down expensive reasoning steps."
Working Together: In production, all three layers often operate simultaneously. A chat request first checks the response cache for an exact match. If not found, it proceeds to the model, which uses KV caching to efficiently process the conversation history. For complex agentic workflows, plan caching determines whether to invoke the expensive planner or reuse an adapted template.
💡 Key Takeaways
Response caching at the application layer eliminates entire LLM calls for identical queries, with typical hit rates of 20 to 40 percent in enterprise scenarios
KV caching at the provider level stores attention tensors to avoid recomputing them for conversation history, cutting per token compute by 30 to 70 percent
Agentic plan caching stores abstracted reasoning workflows that can be adapted to new tasks, reducing expensive planner invocations by roughly 47 percent
These layers stack: a single request might benefit from all three, with each layer optimizing a different part of the inference pipeline
📌 Examples
1Chat application: Response cache serves FAQ answers in 5ms. For novel questions, KV cache reuses attention over 2,000 token history, saving 60% compute per new token
2Financial agent: Plan cache provides generic earnings analysis workflow in 100ms via small adapter model, avoiding 2 second planner call for structurally similar tasks
3Customer support: Application cache stores exact responses for common issues. KV cache speeds up multi-turn troubleshooting. No plan cache needed for simple query/response pattern
← Back to LLM Caching & Cost Optimization Overview
Loading...