Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→1 of 6
Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Medium⏱️ ~3 min
What is KV Cache in LLM Serving?
Large Language Models (LLMs) generate text autoregressively, producing one token at a time. Without optimization, each new token would require recomputing attention for all previous tokens, making generation quadratically expensive. Key Value (KV) cache solves this by storing the attention keys and values for already processed tokens in GPU memory, so each new token only computes attention against the cached state instead of reprocessing the entire history.
The memory cost is substantial. For decoder only models, each token requires approximately 2 × number_of_layers × number_of_heads × dimension_per_head × precision_bytes of memory. In half precision (FP16), Llama 2 7B uses about 0.5 MB per token. Larger models like BLOOM 176B consume roughly 4 MB per token. This memory grows with both batch size and sequence length, making cache management the central challenge in LLM serving.
The tradeoff is memory versus recomputation. Without KV caching, each decode step performs quadratic computation in sequence length, which is prohibitively expensive. With caching, you convert this to linear computation in new tokens but consume large amounts of GPU memory. On an 80 GB GPU serving a 7B model, the weights use around 14 GB, leaving about 50 GB for KV cache. At 0.5 MB per token, this supports roughly 100,000 token entries total. A batch of 64 sequences with 800 tokens each already consumes 25.6 GB, and growing each sequence by 200 tokens adds another 6.4 GB.
Production systems must explicitly partition GPU memory among model weights, KV cache, and activations. Companies like Meta use grouped query attention in Llama 2 70B to reduce the number of KV heads, cutting cache size proportionally. Google applied similar techniques in the PaLM family. The failure mode to watch for is running out of memory mid generation when sequences grow longer than expected, which causes user visible errors and requires aborting requests.
💡 Key Takeaways
•KV cache converts attention computation from quadratic in total sequence length to linear in new tokens by storing previously computed keys and values
•Memory cost per token: 2 × layers × heads × head_dimension × precision_bytes. Llama 2 7B uses 0.5 MB per token, BLOOM 176B uses 4 MB per token in FP16
•On an 80 GB GPU, a 7B model with 14 GB weights leaves about 50 GB for KV cache, supporting roughly 100,000 token entries total across all sequences
•Grouped query attention reduces KV memory by sharing key value heads across multiple query heads, cutting cache size proportionally as used in Llama 2 70B and PaLM models
•Primary failure mode is out of memory during generation when sequences grow beyond estimates, causing request aborts and user visible errors
•The fundamental tradeoff is memory consumption versus recomputation cost; without caching, generation is prohibitively expensive but with it, memory becomes the bottleneck
📌 Examples
Llama 2 7B in FP16: Each token stores approximately 0.5 MB of KV data. A batch of 64 sequences at 800 tokens each consumes 64 × 800 × 0.5 MB = 25.6 GB of cache
BLOOM 176B: At roughly 4 MB per token, just 12,500 tokens would fill 50 GB of cache memory, severely limiting batch size and sequence length
Meta's Llama 2 70B uses grouped query attention with 8 KV heads for 64 query heads, reducing KV cache size by 8x compared to multi head attention
Typical GPU memory allocation: 14 GB for 7B model weights, 50 GB for KV cache (60% of available memory), remainder for activations and overhead