Learn→Model Serving & Inference→Latency Optimization (Batching, Caching, Quantization)→2 of 6

Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min

How Does Dynamic Batching Balance Throughput and Latency?

The Throughput Gain
Dynamic batching processes multiple inference requests together to amortize fixed overhead and improve hardware utilization. For GPUs and TPUs, running many operations in parallel dramatically increases arithmetic intensity, the ratio of computation to memory transfer. This pushes the workload from being memory bandwidth bound toward compute bound, where the accelerator's floating point units are fully utilized. For LLMs, batching can improve throughput from tens to hundreds of tokens per second on the same hardware.
Micro-batching Windows
Production systems use micro batching windows, typically 10 to 50 milliseconds, to collect arriving requests before starting inference. This window size balances two competing forces: longer windows gather more requests and improve throughput, but they also add queuing delay that increases user perceived latency. The key metric is p95 or p99 latency rather than just average latency. A 20 millisecond batching window might improve throughput by 3x while adding only 20 ms to p50 latency.
Continuous Batching
Continuous batching solves a critical problem with naive static batching: head of line blocking. In traditional batching, all requests in a batch must finish before the next batch starts. If one request generates 1,000 tokens while others need only 50, the short requests wait unnecessarily. Continuous batching interleaves decoding steps across dissimilar requests, scheduling a few tokens per request in round robin fashion. This avoids stragglers holding up the entire batch and dramatically improves tail latency for mixed workloads.
Memory Trade-offs
Batch size directly multiplies KV cache and activation memory requirements. A batch of 8 requests with 2,000 token contexts uses 8x the KV memory of a single request. Larger batches also increase padding waste when sequence lengths vary widely: a batch with lengths [100, 100, 2000] pads the short sequences to 2,000, wasting 95% of compute on padding tokens. Production systems cap maximum batch size based on memory budget and use admission control to reject or queue requests during traffic bursts rather than triggering OOM errors.

💡 Key Takeaways

✓Batching improves tokens per second throughput by 3× to 10× by increasing arithmetic intensity and utilizing Graphics Processing Unit (GPU) parallel compute, amortizing fixed overhead across requests

✓Micro batching windows of 10 to 50 milliseconds balance throughput gains against queuing delay, targeting p95 latency service level objectives (SLOs) rather than just average latency

✓Continuous batching interleaves decoding across requests to avoid head of line blocking, where one long sequence delays shorter requests in traditional static batches

✓Memory cost scales linearly with batch size: batch of 8 with 2k tokens uses 8× the KV cache and activation memory, requiring careful capacity planning to avoid out of memory (OOM) errors

✓Padding inefficiency occurs with heterogeneous sequence lengths; a batch with lengths [100, 100, 2000] wastes 95% of compute on padding for the short sequences

✓Admission control and backpressure are critical during traffic bursts; systems must queue or reject requests rather than accept batches that exceed memory budget and crash

📌 Interview Tips

1Meta and Amazon deploy heterogeneous continuous batching to serve mixed workloads of short queries (50 tokens) and long document generation (2000+ tokens) on shared infrastructure without tail latency degradation

2A 20 ms batching window can add only 20 ms to p50 latency while improving throughput by 3×, keeping interactive chat experiences responsive

3Production systems cap batch size at 16 or 32 requests based on available GPU memory, monitoring KV cache plus activations to stay below 80% memory utilization and leave headroom for bursty traffic

← Back to Latency Optimization (Batching, Caching, Quantization) Overview