Natural Language Processing SystemsScalability (Model Parallelism, Batching)Hard⏱️ ~3 min

How Do You Design an Inference Serving System with Dynamic Batching and KV Cache Management?

Inference serving for large language models requires balancing latency, throughput, and memory under highly variable request patterns. The core components are a dispatcher that batches requests intelligently, a worker pool that executes model forward passes, and a key value (KV) cache manager that prevents memory fragmentation and exhaustion. Start with the dispatcher. Requests arrive with varying prompt lengths and generation targets. A naive approach processes one at a time, leaving the GPU idle between requests. Dynamic batching holds arrivals for a small window, typically 5 to 20 milliseconds, then packs similar length prompts into one batch. Length bucketing groups requests into bins like 512, 1024, 2048, and 4096 tokens to avoid head of line blocking where a single 4096 token prompt delays eight 512 token prompts. This small queueing delay often improves throughput by 2 to 5 times. NVIDIA Triton reports throughput gains of 2 to 3 times with single digit millisecond added latency. Continuous batching extends this by separating prefill and decode phases. Prefill computes the initial KV cache for the entire prompt in one large matrix multiply, which benefits from large batch sizes. Decode generates one token per sequence per iteration, reusing cached keys and values. Continuous batching maintains a live decode schedule, adding new sequences as soon as capacity frees up and removing completed ones immediately. This keeps the GPU fully utilized without waiting for an entire batch to finish. KV cache memory is the critical bottleneck. Each token stores keys and values across all layers, heads, and dimensions. For a large model with 80 layers, 64 heads, and 128 head dimension, one token can consume 2 to 3 MB. A batch of 512 sequences at 4 thousand tokens each requires over 1 TB of KV memory across the model. Naive allocation causes fragmentation. Paged KV caches allocate memory in fixed size pages, like operating system virtual memory, enabling efficient reuse and reducing fragmentation. OpenAI adopted paged attention and continuous batching to sustain high utilization at scale. Implement admission control to prevent overload. Track available KV memory and queue depth. When memory headroom falls below 20 percent or queue delay exceeds p99 targets, reject new requests or route them to another pool. Autoscale based on sustained queue depth and utilization trends, not instantaneous spikes. For a 70 billion parameter model on 8 A100 GPUs, prefill an 8 thousand token prompt in 1 to 2 seconds, then decode at 600 to 1200 tokens per second aggregate depending on batch size and context length. Monitor tokens per second, average batch size, queue delay, streaming multiprocessor (SM) occupancy, and memory headroom. When SM occupancy drops below 70 percent, investigate kernel launch overhead or small batch sizes. When memory headroom is low, increase page eviction or reduce maximum batch size.
💡 Key Takeaways
Dynamic batching groups requests by length buckets during a 5 to 20 millisecond window, improving throughput 2 to 5 times with single digit millisecond queueing delay
Continuous batching separates prefill and decode, adding new sequences as capacity frees and removing completed ones immediately to maximize GPU occupancy without waiting for batch completion
KV cache memory scales with layers times heads times head dimension times tokens times bytes, reaching over 1 TB for 512 sequences at 4 thousand tokens each in large models
Paged KV caches allocate memory in fixed pages like operating system virtual memory, reducing fragmentation and enabling efficient reuse across variable length sequences
Admission control rejects requests when KV memory headroom falls below 20 percent or queue delay exceeds p99 targets, preventing cascading failures during traffic spikes
📌 Examples
NVIDIA Triton dynamic batching increases throughput from 400 to 1200 tokens per second by batching 32 requests together with 8 millisecond queueing delay
OpenAI uses continuous batching with paged attention to sustain high utilization, adding sequences every 50 milliseconds as slots free up in the decode schedule
A 70 billion parameter model on 8 A100 GPUs prefills an 8 thousand token prompt in 1.5 seconds, then decodes 64 sequences at 1200 tokens per second aggregate
Paged KV cache with 16 KB pages tracks 80 layers × 64 heads × 128 head dim × 4096 tokens = 2.5 MB per sequence, using memory pool to avoid fragmentation and OOM errors
← Back to Scalability (Model Parallelism, Batching) Overview