Three Layers of LLM Caching
Understanding the Caching Stack
Production LLM systems use multiple caching layers working together, each optimizing different aspects of the inference pipeline. Understanding where and how each layer operates is crucial for interview discussions about system design.
How KV Caching Works
For interactive chat, KV caching provides massive savings. Imagine a 10 turn conversation where each message adds 50 tokens. Without KV caching, generating token 501 (first token of turn 11) would require computing attention over all 500 previous tokens. With KV caching, providers like OpenAI and Anthropic keep those attention tensors in GPU memory, so only the new user message needs full forward passes. This cuts per token compute by 30 to 70 percent depending on conversation length. A chat session with 2,000 tokens of history might see 60 percent of compute eliminated through KV reuse. The technique uses paged memory management to keep more concurrent users per GPU by swapping rarely accessed keys to host memory when GPU memory gets tight.
Agentic Plan Caching Example
Consider a financial analysis agent that must generate earnings reports. A large planner model (2 to 3 seconds p95 latency) creates a multi-step plan: fetch earnings data, calculate growth metrics, compare to sector average, generate summary. A smaller actor model (200 to 400 milliseconds p95) executes each step. Without plan caching, every new ticker requires calling the expensive planner. With plan caching, the system extracts task intent ("quarterly earnings analysis"), looks up a cached generic plan, and adapts it with a cheaper model by filling in the specific ticker and date. Research shows this reduces planner invocations by roughly 47 percent while maintaining about 97 percent of baseline accuracy.
Working Together
In production, all three layers often operate simultaneously. A chat request first checks the response cache for an exact match. If not found, it proceeds to the model, which uses KV caching to efficiently process the conversation history. For complex agentic workflows, plan caching determines whether to invoke the expensive planner or reuse an adapted template.