What is KV Cache and Why Does It Dominate Memory in LLM Inference?
The Memory Cost
The memory cost is substantial and scales with both model size and context length. For Llama 2 7B at half precision, each token requires approximately 0.5 MB of KV cache. A single 2,000 token conversation consumes roughly 1 GB of cache memory, while an 8,000 token context uses about 4 GB. The 176B parameter BLOOM model needs approximately 4 MB per token, so a 4,000 token session alone requires 16 GB just for the cache.
The Formula
The precise formula is: 2 x batch_size x seq_length x num_layers x num_heads x head_dim x bytes_per_element. The factor of 2 accounts for separate key and value tensors. For a 7B model with 32 layers, 32 heads, 128 dimensional heads, and FP16 precision, this works out to approximately 524,288 bytes per token.
Capacity Planning
When planning serving capacity, you must budget for model weights plus KV cache plus activations. A 7B model with 14 GB of weights can fit only 3 to 4 concurrent 2,000 token sessions on a 24 GB GPU before running out of memory.
Architectural Optimizations
Production systems have adopted architectural modifications to reduce KV memory. Grouped Query Attention (GQA) in Llama 2 70B and Mistral 7B reduces the number of key value heads while keeping query heads the same, cutting KV proportionally. Multi Query Attention (MQA) takes this further by using a single KV head. Sliding Window Attention (SWA) in Mistral 7B bounds attention to the most recent 4,096 tokens, capping memory growth for very long contexts.