Dynamic Batching: Throughput vs Latency Tradeoffs in Request Scheduling
The Batching Philosophy
Dynamic batching aggregates individual inference requests into larger batches before execution, dramatically improving device utilization and throughput at the cost of queueing delay. Instead of processing one request at a time with the GPU sitting idle between arrivals, the scheduler waits for a configurable window (typically 1 to 50 milliseconds) to collect multiple requests, then executes them together. A GPU running ResNet50 might process a single image in 5 milliseconds but can process a batch of 32 images in only 20 milliseconds, achieving 8x better throughput per GPU hour. This efficiency directly translates to cost savings: serving the same QPS with batching can reduce required GPU instances by 50% to 70%.
When Batching Breaks Down
The tradeoff becomes critical under real world traffic patterns. For steady, high volume streams like video recommendation ranking at YouTube or feed ranking at Meta, batching works beautifully: requests arrive fast enough to form full batches within millisecond windows, and the added queueing delay is negligible compared to total processing time. However, for bursty or low QPS services, dynamic batching can destroy tail latency. If your SLO requires p95 latency under 100 milliseconds but your batch window is 20 milliseconds, you have already consumed 20% of your budget just waiting in queue before any computation starts.
The Counterintuitive Failure Mode
Under spiky traffic, you will see low GPU utilization yet high p95 and p99 latencies simultaneously. This happens because requests arrive slowly, sit waiting for batch formation, and burn latency budget before computation even starts.
Production Tuning
Production systems tune three knobs: maximum batch size (constrained by GPU memory), batch formation timeout (the window to wait), and per model concurrency (parallel instances). TensorFlow Serving typically uses 4 to 16 millisecond windows for strict latency services, while Triton configurations for throughput optimized workloads can use 50 millisecond windows with batch sizes of 64 to 128. The key insight is to separate queueing time from compute time in your metrics: if queue time dominates, reduce the window or increase concurrency rather than adding more GPUs.