Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

How Do You Manage KV Cache Memory in Production?

MEMORY PRESSURE

KV cache is the primary memory consumer in LLM serving. For a 70B model serving 100 concurrent requests with 4K context each, KV cache alone requires ~100GB. This often exceeds available GPU memory.

Memory management determines throughput. More concurrent requests = higher throughput, but each request needs KV cache memory. The scheduler must balance request count against available memory.

PAGED ATTENTION

Traditional KV cache allocates contiguous memory for max sequence length, wasting memory for shorter sequences. Paged attention allocates memory in fixed-size blocks, like virtual memory in operating systems.

Benefits: No wasted memory for shorter sequences. Better memory fragmentation management. Enables memory sharing across requests (for common prefixes). vLLM implements paged attention and achieves 2-4x better memory efficiency.

MEMORY OFFLOADING

When GPU memory is exhausted, offload KV cache to CPU memory. This is slower but allows serving more concurrent requests.

Trade-off: CPU-to-GPU transfer adds latency (~1-5ms per token depending on cache size). Acceptable for throughput-focused batch workloads. Not suitable for latency-critical interactive use.

Tiered approach: keep hot requests on GPU, cold requests (waiting, low priority) on CPU. Move requests between tiers based on activity.

PREEMPTION STRATEGIES

When memory is tight, preempt lower-priority requests to make room for higher-priority ones.

Options: Swap KV cache to CPU (preserve progress, resume later). Drop KV cache entirely (restart generation from scratch). The choice depends on request priority and expected remaining length.

⚠️ Key Trade-off: Memory management is the core challenge. Paged attention for efficiency, offloading for capacity, preemption for priority—combine strategies based on your latency vs throughput requirements.
💡 Key Takeaways
KV cache dominates memory; 70B model with 100 requests × 4K context = ~100GB cache alone
Paged attention: allocate in blocks like virtual memory; 2-4x better memory efficiency via vLLM
Offloading to CPU: increases capacity but adds latency; use tiered approach (hot on GPU, cold on CPU)
📌 Interview Tips
1Interview Tip: Explain paged attention analogy: KV cache pages like OS virtual memory pages.
2Interview Tip: Describe preemption strategies: swap to CPU (preserve progress) vs drop (restart).
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview