Model Serving & InferenceServing Infrastructure (TensorFlow Serving, TorchServe, Triton)Medium⏱️ ~3 min

Dynamic Batching: Throughput vs Latency Tradeoffs in Request Scheduling

Dynamic batching aggregates individual inference requests into larger batches before execution, dramatically improving device utilization and throughput at the cost of queueing delay. Instead of processing one request at a time with the GPU sitting idle between arrivals, the scheduler waits for a configurable window (typically 1 to 50 milliseconds) to collect multiple requests, then executes them together. A GPU running ResNet50 might process a single image in 5 milliseconds but can process a batch of 32 images in only 20 milliseconds, achieving 8x better throughput per GPU hour. This efficiency directly translates to cost savings: serving the same queries per second (QPS) with batching can reduce required GPU instances by 50% to 70%. The tradeoff becomes critical under real world traffic patterns. For steady, high volume streams like video recommendation ranking at YouTube or feed ranking at Meta, batching works beautifully: requests arrive fast enough to form full batches within millisecond windows, and the added queueing delay is negligible compared to total processing time. However, for bursty or low QPS services, dynamic batching can destroy tail latency. If your Service Level Objective (SLO) requires p95 latency under 100 milliseconds but your batch window is 20 milliseconds, you have already consumed 20% of your budget just waiting in queue before any computation starts. Under spiky traffic, you will see low GPU utilization yet high p95 and p99 latencies simultaneously, a counterintuitive failure mode. Production systems tune three knobs: maximum batch size (constrained by GPU memory), batch formation timeout (the window to wait), and per model concurrency (parallel instances). Google's TensorFlow Serving typically uses 4 to 16 millisecond windows for strict latency services, while Triton configurations at NVIDIA for throughput optimized workloads can use 50 millisecond windows with batch sizes of 64 to 128. The key insight is to separate queueing time from compute time in your metrics: if queue time dominates, reduce the window or increase concurrency rather than adding more GPUs.
💡 Key Takeaways
Batching can improve throughput by 3x to 8x: single ResNet50 image in 5 milliseconds versus batch of 32 images in 20 milliseconds, reducing cost per inference by 50% to 70% through better GPU utilization
Batch formation window introduces queueing delay before any computation: a 20 millisecond window consumes 20% of a 100 millisecond p95 latency budget, making it unsuitable for strict SLOs
Failure mode under low or bursty QPS: simultaneously observing low GPU utilization (requests do not arrive fast enough to form batches) and high p95 or p99 latency (requests wait in queue)
Production window configurations: TensorFlow Serving uses 4 to 16 millisecond windows for latency sensitive services, Triton uses 50 millisecond windows for throughput optimized batch workloads
Separate queueing time from compute time in metrics to diagnose bottlenecks: if queue time dominates, reduce window or add per model concurrency rather than scaling horizontally
Maximum batch size is constrained by GPU memory: batch size multiplied by activation footprint must fit in device memory, typically limiting to 16 to 128 depending on model size
📌 Examples
YouTube recommendation ranking uses aggressive batching with 50 millisecond windows during steady traffic, forming batches of 128 requests and achieving 85% GPU utilization while keeping p95 latency under 200 milliseconds
Uber real time Estimated Time of Arrival (ETA) prediction disabled dynamic batching entirely (max batch size 1) to guarantee sub 50 millisecond p99 latency for rider app, accepting 3x higher GPU cost
Medical imaging inference service tuned Triton to 10 millisecond window with max batch size 4 due to 16 GB GPU memory constraint from 256×256×24 voxel volumes, achieving 60% utilization on steady clinical workflow traffic
← Back to Serving Infrastructure (TensorFlow Serving, TorchServe, Triton) Overview
Dynamic Batching: Throughput vs Latency Tradeoffs in Request Scheduling | Serving Infrastructure (TensorFlow Serving, TorchServe, Triton) - System Overflow