Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min
What is KV Cache and Why Does It Dominate Memory in LLM Inference?
Key Value (KV) cache is an optimization that stores intermediate attention computations to avoid redundant work during autoregressive text generation. When a large language model generates tokens one at a time, each new token must attend to all previous tokens in the sequence. Without caching, the model would recompute attention keys and values for the entire history at every step, making cost quadratic in sequence length. KV cache trades memory for speed by storing these values once and reusing them.
The memory cost is substantial and scales with both model size and context length. For Llama 2 7B at half precision, each token requires approximately 0.5 MB of KV cache. A single 2,000 token conversation consumes roughly 1 GB of cache memory, while an 8,000 token context uses about 4 GB. The 176 billion parameter BLOOM model needs approximately 4 MB per token, so a 4,000 token session alone requires 16 GB just for the cache, approaching the size of many GPU memory budgets.
The precise formula is 2 × batch_size × seq_length × num_layers × num_heads × head_dim × bytes_per_element. The factor of 2 accounts for separate key and value tensors. For a 7B model with 32 layers, 32 heads, 128 dimensional heads, and FP16 precision, this works out to 2 × 1 × t × 32 × 32 × 128 × 2 bytes, or approximately 524,288 bytes per token. When planning serving capacity, you must budget for model weights plus KV cache plus activations. A 7B model with 14 GB of weights can fit only 3 to 4 concurrent 2,000 token sessions on a 24 GB GPU before running out of memory.
Production systems like those at Google and Meta have adopted architectural modifications specifically to reduce KV memory. Grouped Query Attention (GQA) in Llama 2 70B and Mistral 7B reduces the number of key value heads while keeping query heads the same, cutting KV proportionally. Multi Query Attention (MQA) takes this further by using a single KV head. Sliding Window Attention (SWA) in Mistral 7B bounds attention to the most recent 4,096 tokens, capping memory growth for very long contexts.
💡 Key Takeaways
•KV cache stores attention keys and values to make decoding linear time instead of quadratic, avoiding recomputation of the entire sequence history at each new token
•Memory cost formula is 2 × batch_size × seq_length × num_layers × num_heads × head_dim × bytes_per_element, with the factor of 2 for separate key and value tensors
•Llama 2 7B uses approximately 0.5 MB per token at FP16, so a 2,000 token session requires 1 GB and 8,000 tokens need 4 GB of KV memory alone
•BLOOM 176B requires about 4 MB per token, meaning a single 4,000 token conversation consumes 16 GB just for the cache, often exceeding available GPU memory
•Architectural optimizations like Grouped Query Attention (GQA) in Llama 2 70B and Sliding Window Attention (SWA) in Mistral 7B reduce KV memory by sharing heads or bounding context window
•Capacity planning must account for total memory as model weights plus KV cache plus activations; a 7B model with 14 GB weights supports only 3 to 4 concurrent 2k token sessions on 24 GB GPU
📌 Examples
Google production models use GQA to reduce KV cache size proportionally to the number of query groups, allowing higher concurrency without quality loss
Mistral 7B uses Sliding Window Attention limited to 4,096 tokens, capping KV memory growth and enabling bounded latency for very long conversations
Internal KV caching yields over 10× faster responses for multi turn dialogs compared to recomputing full context, and next token latency drops by approximately 50% when cache is available