Learn→Model Serving & Inference→Latency Optimization (Batching, Caching, Quantization)→1 of 6

Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min

What is KV Cache and Why Does It Dominate Memory in LLM Inference?

Definition
KV Cache (Key-Value Cache) stores intermediate attention computations to avoid redundant work during autoregressive text generation. When generating tokens one at a time, each new token must attend to all previous tokens. KV cache trades memory for speed by storing attention keys and values once and reusing them.
The Memory Cost
The memory cost is substantial and scales with both model size and context length. For Llama 2 7B at half precision, each token requires approximately 0.5 MB of KV cache. A single 2,000 token conversation consumes roughly 1 GB of cache memory, while an 8,000 token context uses about 4 GB. The 176B parameter BLOOM model needs approximately 4 MB per token, so a 4,000 token session alone requires 16 GB just for the cache.
The Formula
The precise formula is: 2 x batch_size x seq_length x num_layers x num_heads x head_dim x bytes_per_element. The factor of 2 accounts for separate key and value tensors. For a 7B model with 32 layers, 32 heads, 128 dimensional heads, and FP16 precision, this works out to approximately 524,288 bytes per token.
Capacity Planning
When planning serving capacity, you must budget for model weights plus KV cache plus activations. A 7B model with 14 GB of weights can fit only 3 to 4 concurrent 2,000 token sessions on a 24 GB GPU before running out of memory.
Architectural Optimizations
Production systems have adopted architectural modifications to reduce KV memory. Grouped Query Attention (GQA) in Llama 2 70B and Mistral 7B reduces the number of key value heads while keeping query heads the same, cutting KV proportionally. Multi Query Attention (MQA) takes this further by using a single KV head. Sliding Window Attention (SWA) in Mistral 7B bounds attention to the most recent 4,096 tokens, capping memory growth for very long contexts.

💡 Key Takeaways

✓KV cache stores attention keys and values to make decoding linear time instead of quadratic, avoiding recomputation of the entire sequence history at each new token

✓Memory cost formula is 2 × batch_size × seq_length × num_layers × num_heads × head_dim × bytes_per_element, with the factor of 2 for separate key and value tensors

✓Llama 2 7B uses approximately 0.5 MB per token at FP16, so a 2,000 token session requires 1 GB and 8,000 tokens need 4 GB of KV memory alone

✓BLOOM 176B requires about 4 MB per token, meaning a single 4,000 token conversation consumes 16 GB just for the cache, often exceeding available GPU memory

✓Architectural optimizations like Grouped Query Attention (GQA) in Llama 2 70B and Sliding Window Attention (SWA) in Mistral 7B reduce KV memory by sharing heads or bounding context window

✓Capacity planning must account for total memory as model weights plus KV cache plus activations; a 7B model with 14 GB weights supports only 3 to 4 concurrent 2k token sessions on 24 GB GPU

📌 Interview Tips

1Google production models use GQA to reduce KV cache size proportionally to the number of query groups, allowing higher concurrency without quality loss

2Mistral 7B uses Sliding Window Attention limited to 4,096 tokens, capping KV memory growth and enabling bounded latency for very long conversations

3Internal KV caching yields over 10× faster responses for multi turn dialogs compared to recomputing full context, and next token latency drops by approximately 50% when cache is available

← Back to Latency Optimization (Batching, Caching, Quantization) Overview