Learn→Model Serving & Inference→Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)→2 of 6

Model Serving & Inference • Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)Medium⏱️ ~3 min

Dynamic Batching: Throughput vs Latency Tradeoffs in Request Scheduling

The Batching Philosophy
Dynamic batching aggregates individual inference requests into larger batches before execution, dramatically improving device utilization and throughput at the cost of queueing delay. Instead of processing one request at a time with the GPU sitting idle between arrivals, the scheduler waits for a configurable window (typically 1 to 50 milliseconds) to collect multiple requests, then executes them together. A GPU running ResNet50 might process a single image in 5 milliseconds but can process a batch of 32 images in only 20 milliseconds, achieving 8x better throughput per GPU hour. This efficiency directly translates to cost savings: serving the same QPS with batching can reduce required GPU instances by 50% to 70%.
When Batching Breaks Down
The tradeoff becomes critical under real world traffic patterns. For steady, high volume streams like video recommendation ranking at YouTube or feed ranking at Meta, batching works beautifully: requests arrive fast enough to form full batches within millisecond windows, and the added queueing delay is negligible compared to total processing time. However, for bursty or low QPS services, dynamic batching can destroy tail latency. If your SLO requires p95 latency under 100 milliseconds but your batch window is 20 milliseconds, you have already consumed 20% of your budget just waiting in queue before any computation starts.
The Counterintuitive Failure Mode
Under spiky traffic, you will see low GPU utilization yet high p95 and p99 latencies simultaneously. This happens because requests arrive slowly, sit waiting for batch formation, and burn latency budget before computation even starts.
Production Tuning
Production systems tune three knobs: maximum batch size (constrained by GPU memory), batch formation timeout (the window to wait), and per model concurrency (parallel instances). TensorFlow Serving typically uses 4 to 16 millisecond windows for strict latency services, while Triton configurations for throughput optimized workloads can use 50 millisecond windows with batch sizes of 64 to 128. The key insight is to separate queueing time from compute time in your metrics: if queue time dominates, reduce the window or increase concurrency rather than adding more GPUs.

💡 Key Takeaways

✓Batching can improve throughput by 3x to 8x: single ResNet50 image in 5 milliseconds versus batch of 32 images in 20 milliseconds, reducing cost per inference by 50% to 70% through better GPU utilization

✓Batch formation window introduces queueing delay before any computation: a 20 millisecond window consumes 20% of a 100 millisecond p95 latency budget, making it unsuitable for strict SLOs

✓Failure mode under low or bursty QPS: simultaneously observing low GPU utilization (requests do not arrive fast enough to form batches) and high p95 or p99 latency (requests wait in queue)

✓Production window configurations: TensorFlow Serving uses 4 to 16 millisecond windows for latency sensitive services, Triton uses 50 millisecond windows for throughput optimized batch workloads

✓Separate queueing time from compute time in metrics to diagnose bottlenecks: if queue time dominates, reduce window or add per model concurrency rather than scaling horizontally

✓Maximum batch size is constrained by GPU memory: batch size multiplied by activation footprint must fit in device memory, typically limiting to 16 to 128 depending on model size

📌 Interview Tips

1YouTube recommendation ranking uses aggressive batching with 50 millisecond windows during steady traffic, forming batches of 128 requests and achieving 85% GPU utilization while keeping p95 latency under 200 milliseconds

2Uber real time Estimated Time of Arrival (ETA) prediction disabled dynamic batching entirely (max batch size 1) to guarantee sub 50 millisecond p99 latency for rider app, accepting 3x higher GPU cost

3Medical imaging inference service tuned Triton to 10 millisecond window with max batch size 4 due to 16 GB GPU memory constraint from 256×256×24 voxel volumes, achieving 60% utilization on steady clinical workflow traffic

← Back to Serving Infrastructure (TensorFlow Serving, TorchServe, Triton) Overview