What is KV Cache in LLM Serving?
WHY KV CACHE MATTERS
LLMs generate text one token at a time. At each step, the attention mechanism needs to attend to all previous tokens. Without caching, generating token N requires recomputing attention over tokens 1 through N-1. This is O(N²) computation over a sequence.
With KV cache, keys and values for tokens 1 through N-1 are stored from previous steps. Generating token N only requires computing attention for the new token against cached values. This reduces per-token computation from O(N) to O(1), making generation dramatically faster.
HOW IT WORKS
During the first forward pass (prompt processing), compute K and V matrices for all prompt tokens and store them. For each subsequent token generated:
1. Compute Q, K, V for just the new token
2. Append new K, V to the cache
3. Compute attention using new Q against all cached K, V
4. Generate next token prediction
The cache grows linearly with sequence length. For a 70B parameter model with 8K context, KV cache can consume 16-32GB of GPU memory.
MEMORY IMPLICATIONS
KV cache size per token: 2 × num_layers × hidden_dim × bytes_per_value. For Llama-70B (80 layers, 8192 hidden dim, FP16): ~2.6MB per token. An 8K context sequence requires ~20GB just for KV cache.
This memory pressure is why context length and batch size trade off directly. More concurrent requests = less context per request, or more GPUs needed.