Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Medium⏱️ ~3 min

What is KV Cache in LLM Serving?

Definition
KV cache stores the key-value pairs computed during attention for previously generated tokens, eliminating redundant computation when generating each new token.

WHY KV CACHE MATTERS

LLMs generate text one token at a time. At each step, the attention mechanism needs to attend to all previous tokens. Without caching, generating token N requires recomputing attention over tokens 1 through N-1. This is O(N²) computation over a sequence.

With KV cache, keys and values for tokens 1 through N-1 are stored from previous steps. Generating token N only requires computing attention for the new token against cached values. This reduces per-token computation from O(N) to O(1), making generation dramatically faster.

HOW IT WORKS

During the first forward pass (prompt processing), compute K and V matrices for all prompt tokens and store them. For each subsequent token generated:

1. Compute Q, K, V for just the new token

2. Append new K, V to the cache

3. Compute attention using new Q against all cached K, V

4. Generate next token prediction

The cache grows linearly with sequence length. For a 70B parameter model with 8K context, KV cache can consume 16-32GB of GPU memory.

MEMORY IMPLICATIONS

KV cache size per token: 2 × num_layers × hidden_dim × bytes_per_value. For Llama-70B (80 layers, 8192 hidden dim, FP16): ~2.6MB per token. An 8K context sequence requires ~20GB just for KV cache.

This memory pressure is why context length and batch size trade off directly. More concurrent requests = less context per request, or more GPUs needed.

💡 Key Insight: KV cache transforms generation from compute-bound to memory-bound. Optimizing LLM serving is largely about managing KV cache memory efficiently.
💡 Key Takeaways
Without KV cache: O(N²) computation per sequence; with cache: O(N) total, O(1) per token increment
Cache size per token: 2 × layers × hidden_dim × bytes; Llama-70B needs ~2.6MB/token, 20GB for 8K context
KV cache makes LLM serving memory-bound; memory management is the key optimization lever
📌 Interview Tips
1Interview Tip: Explain the O(N²) to O(N) speedup and why generation becomes memory-bound.
2Interview Tip: Calculate KV cache size for a specific model to show you understand the memory implications.
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview
What is KV Cache in LLM Serving? | LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) - System Overflow