How Do You Manage KV Cache Memory in Production?
MEMORY PRESSURE
KV cache is the primary memory consumer in LLM serving. For a 70B model serving 100 concurrent requests with 4K context each, KV cache alone requires ~100GB. This often exceeds available GPU memory.
Memory management determines throughput. More concurrent requests = higher throughput, but each request needs KV cache memory. The scheduler must balance request count against available memory.
PAGED ATTENTION
Traditional KV cache allocates contiguous memory for max sequence length, wasting memory for shorter sequences. Paged attention allocates memory in fixed-size blocks, like virtual memory in operating systems.
Benefits: No wasted memory for shorter sequences. Better memory fragmentation management. Enables memory sharing across requests (for common prefixes). vLLM implements paged attention and achieves 2-4x better memory efficiency.
MEMORY OFFLOADING
When GPU memory is exhausted, offload KV cache to CPU memory. This is slower but allows serving more concurrent requests.
Trade-off: CPU-to-GPU transfer adds latency (~1-5ms per token depending on cache size). Acceptable for throughput-focused batch workloads. Not suitable for latency-critical interactive use.
Tiered approach: keep hot requests on GPU, cold requests (waiting, low priority) on CPU. Move requests between tiers based on activity.
PREEMPTION STRATEGIES
When memory is tight, preempt lower-priority requests to make room for higher-priority ones.
Options: Swap KV cache to CPU (preserve progress, resume later). Drop KV cache entirely (restart generation from scratch). The choice depends on request priority and expected remaining length.