Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→1 of 6

Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Medium⏱️ ~3 min

What is KV Cache in LLM Serving?

Definition
KV cache stores the key-value pairs computed during attention for previously generated tokens, eliminating redundant computation when generating each new token.
WHY KV CACHE MATTERS
LLMs generate text one token at a time. At each step, the attention mechanism needs to attend to all previous tokens. Without caching, generating token N requires recomputing attention over tokens 1 through N-1. This is O(N²) computation over a sequence.
With KV cache, keys and values for tokens 1 through N-1 are stored from previous steps. Generating token N only requires computing attention for the new token against cached values. This reduces per-token computation from O(N) to O(1), making generation dramatically faster.
HOW IT WORKS
During the first forward pass (prompt processing), compute K and V matrices for all prompt tokens and store them. For each subsequent token generated:
1. Compute Q, K, V for just the new token
2. Append new K, V to the cache
3. Compute attention using new Q against all cached K, V
4. Generate next token prediction
The cache grows linearly with sequence length. For a 70B parameter model with 8K context, KV cache can consume 16-32GB of GPU memory.
MEMORY IMPLICATIONS
KV cache size per token: 2 × num_layers × hidden_dim × bytes_per_value. For Llama-70B (80 layers, 8192 hidden dim, FP16): ~2.6MB per token. An 8K context sequence requires ~20GB just for KV cache.
This memory pressure is why context length and batch size trade off directly. More concurrent requests = less context per request, or more GPUs needed.
💡 Key Insight: KV cache transforms generation from compute-bound to memory-bound. Optimizing LLM serving is largely about managing KV cache memory efficiently.

💡 Key Takeaways

✓Without KV cache: O(N²) computation per sequence; with cache: O(N) total, O(1) per token increment

✓Cache size per token: 2 × layers × hidden_dim × bytes; Llama-70B needs ~2.6MB/token, 20GB for 8K context

✓KV cache makes LLM serving memory-bound; memory management is the key optimization lever

📌 Interview Tips

1Interview Tip: Explain the O(N²) to O(N) speedup and why generation becomes memory-bound.

2Interview Tip: Calculate KV cache size for a specific model to show you understand the memory implications.

← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview