Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

How Do You Manage KV Cache Memory in Production?

Production LLM serving requires explicit memory management because KV cache grows with every generated token and can easily exhaust GPU memory. The solution is to treat KV cache as a pool of fixed size blocks, typically 16 or 32 tokens per block, with a mapping from logical sequence positions to physical blocks. This paged layout enables on demand growth, low fragmentation, and sharing across requests. Beam search becomes memory efficient by sharing prefix blocks across beams and allocating new blocks only for diverging continuations. Memory planning starts with explicit partitioning. For a single device deployment, allocate a fixed fraction to KV cache, such as 60 percent of available memory after subtracting model weights. For an 80 GB GPU serving a 7B model with 14 GB weights, reserving 50 GB for KV cache at 0.5 MB per token supports roughly 100,000 token entries. Enforce admission control that tracks current KV occupancy and predicted output lengths. When in doubt, be pessimistic to avoid out of memory failures that abort user requests. For very long contexts or high concurrency, apply KV compression techniques. Quantizing the cache to FP8 or INT8 with per head and per layer scales reduces memory by 2 times to 4 times with minimal quality loss. Cache eviction policies like H2O or FastGen retain sink tokens from the beginning, keep the most recent window, and evict middle tokens with lowest accumulated attention scores. These techniques can achieve up to 80 percent KV reduction with small quality degradation on many workloads. Sliding window attention caps memory by only attending to the most recent W tokens, but this loses long range context and must be supported by the model architecture. Prefix caching optimizes repeated prompts by building a tree index that maps tokenized prefixes to KV block pointers. Common system prompts or few shot examples are warmed into cache, dramatically reducing time to first token (TTFT) for requests with shared prefixes. Use least recently used (LRU) eviction for old prefixes and ensure strict namespace isolation per tenant and per system prompt variant to avoid security issues. The critical failure mode is cache mixing across users due to incorrect tokenization boundaries or hidden personalization tokens, which is a safety incident. Monitor key metrics continuously: KV occupancy percentage, memory waste from fragmentation, tokens per second, inter token latency at p50 and p95, and prefix cache hit rate. Build automated guardrails that back off prefill chunk size when inter token latency rises, temporarily disable speculative decoding on low acceptance, and shed load before out of memory conditions. Distributed parallelism using tensor parallel or pipeline parallel must shard KV pages consistently across devices and synchronize carefully because imbalances cause one shard to run out of memory first, dropping the entire batch.
💡 Key Takeaways
Paged KV cache with fixed block sizes of 16 or 32 tokens enables on demand growth, reduces fragmentation to under 4 percent, and allows prefix sharing across requests
Memory partition example: 80 GB GPU with 14 GB model weights reserves 50 GB (60%) for KV cache, supporting 100,000 tokens at 0.5 MB per token with admission control
KV quantization to FP8 or INT8 with per head scales reduces memory 2x to 4x with minimal quality loss, while cache eviction policies like H2O achieve up to 80% reduction
Prefix caching with tree indexed KV blocks dramatically reduces time to first token for shared system prompts, but requires strict namespace isolation to prevent cache mixing security incidents
Critical metrics to monitor: KV occupancy percentage, memory waste, tokens per second, inter token latency p50 and p95, and prefix cache hit rate
Distributed parallelism must shard KV pages consistently across devices; memory imbalance on one shard causes out of memory failures that drop entire batches
📌 Examples
Llama 2 7B at 0.5 MB per token: 50 GB cache with 16 token blocks (16 tokens × 0.5 MB = 8 MB per block) yields 6,250 blocks, tracking logical to physical mapping per sequence
Prefix caching: System prompt of 200 tokens reused across 1000 requests saves 200,000 token computations, reducing TTFT from 500ms to 50ms for those requests
FP8 quantization: 50 GB cache compressed to 25 GB, doubling concurrency from 64 to 128 sequences while maintaining 99% of quality on dialog tasks
Cache eviction with H2O: 2000 token context reduced to 400 tokens by keeping 50 sink tokens, 300 recent tokens, and 50 high attention middle tokens, maintaining coherence on 85% of queries
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview