Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min
How Does Dynamic Batching Balance Throughput and Latency?
Dynamic batching processes multiple inference requests together to amortize fixed overhead and improve hardware utilization. For Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), running many operations in parallel dramatically increases arithmetic intensity, the ratio of computation to memory transfer. This pushes the workload from being memory bandwidth bound toward compute bound, where the accelerator's floating point units are fully utilized. For large language models, batching can improve throughput from tens to hundreds of tokens per second on the same hardware.
Production systems use micro batching windows, typically 10 to 50 milliseconds, to collect arriving requests before starting inference. This window size balances two competing forces: longer windows gather more requests and improve throughput, but they also add queuing delay that increases user perceived latency. The key metric is p95 or p99 latency, the time by which 95% or 99% of requests complete, rather than just average latency. A 20 millisecond batching window might improve throughput by 3× while adding only 20 ms to p50 latency, keeping interactive experiences snappy.
Continuous batching solves a critical problem with naive static batching: head of line blocking. In traditional batching, all requests in a batch must finish before the next batch starts. If one request generates 1,000 tokens while others need only 50, the short requests wait unnecessarily. Continuous batching interleaves decoding steps across dissimilar requests, scheduling a few tokens per request in round robin fashion. This avoids stragglers holding up the entire batch and dramatically improves tail latency for mixed workloads. Systems at Meta and Amazon use heterogeneous continuous batching to serve both short queries and long document generation on the same infrastructure.
The memory trade off is significant. Batch size directly multiplies KV cache and activation memory requirements. A batch of 8 requests with 2,000 token contexts uses 8× the KV memory of a single request. Larger batches also increase padding waste when sequence lengths vary widely: a batch with lengths [100, 100, 2000] pads the short sequences to 2,000, wasting 95% of compute on padding tokens. Production systems cap maximum batch size based on memory budget and use admission control to reject or queue requests during traffic bursts rather than triggering out of memory errors.
💡 Key Takeaways
•Batching improves tokens per second throughput by 3× to 10× by increasing arithmetic intensity and utilizing Graphics Processing Unit (GPU) parallel compute, amortizing fixed overhead across requests
•Micro batching windows of 10 to 50 milliseconds balance throughput gains against queuing delay, targeting p95 latency service level objectives (SLOs) rather than just average latency
•Continuous batching interleaves decoding across requests to avoid head of line blocking, where one long sequence delays shorter requests in traditional static batches
•Memory cost scales linearly with batch size: batch of 8 with 2k tokens uses 8× the KV cache and activation memory, requiring careful capacity planning to avoid out of memory (OOM) errors
•Padding inefficiency occurs with heterogeneous sequence lengths; a batch with lengths [100, 100, 2000] wastes 95% of compute on padding for the short sequences
•Admission control and backpressure are critical during traffic bursts; systems must queue or reject requests rather than accept batches that exceed memory budget and crash
📌 Examples
Meta and Amazon deploy heterogeneous continuous batching to serve mixed workloads of short queries (50 tokens) and long document generation (2000+ tokens) on shared infrastructure without tail latency degradation
A 20 ms batching window can add only 20 ms to p50 latency while improving throughput by 3×, keeping interactive chat experiences responsive
Production systems cap batch size at 16 or 32 requests based on available GPU memory, monitoring KV cache plus activations to stay below 80% memory utilization and leave headroom for bursty traffic