ML Model OptimizationBatch Size & Throughput TuningMedium⏱️ ~3 min

Dynamic Batching for Low Latency GPU Inference

Dynamic batching is a serving pattern that groups incoming inference requests in real time to maximize GPU throughput while staying within latency budgets. A batcher sits in front of the GPU worker and holds requests for a small time window, typically 2 to 5 milliseconds, or until a maximum count like 16 to 64 is reached. This brief wait allows multiple concurrent requests to arrive and form a batch, which then gets submitted to the GPU as a single operation. The key is balancing the wait time against your Service Level Agreement (SLA). A ranking service with a 100 millisecond p99 budget might allocate 70 milliseconds for model inference after reserving time for upstream feature fetching and downstream aggregation. Holding requests for 2 to 5 milliseconds adds minimal latency but can double or triple throughput by keeping the GPU busy with larger batches instead of processing tiny batches inefficiently. Meta and Google production systems use this approach extensively. At 5,000 to 10,000 queries per second (QPS), a 5 millisecond batching window naturally accumulates 25 to 50 requests. The GPU processes this batch in one kernel launch instead of 50 separate launches, cutting overhead dramatically. Device utilization jumps from 30% to 75%, and per request compute cost drops by 2 to 5 times, all while keeping p99 latency under 100 milliseconds. Under low traffic conditions like overnight hours, batches do not fill completely. The time trigger ensures the batcher flushes whatever has accumulated after the wait period expires, preventing indefinite delays. Adaptive strategies can shrink the window during low traffic and expand it during peak hours to maintain both throughput and latency guarantees.
💡 Key Takeaways
Hold requests for 2 to 5 milliseconds or until reaching a count like 16 to 64, whichever happens first, then flush to GPU
At 5,000 to 10,000 QPS, a 5 millisecond window naturally accumulates 25 to 50 requests without explicit coordination
Device utilization increases from 30% to 75% while p99 latency stays under 100 milliseconds in Meta and Google production systems
Time based flushing prevents indefinite waiting during low traffic periods, ensuring every request eventually gets processed
Adaptive windows shrink during low load to reduce wasted wait time and expand during high load to maximize batch fill
Monitor batch fill ratio, queue time p99, and device utilization to tune the window size and maximum batch count
📌 Examples
Netflix ranking service: 100ms p99 budget allocates 70ms for inference. Dynamic batcher holds requests for 3ms, forms batches of 32, achieves 2.5x throughput gain while staying at 85ms p99.
Uber ETA prediction: 5,000 QPS traffic with 4ms batching window creates batches of 20 to 25 requests. GPU processes batch in 15ms instead of 50 separate 5ms calls, reducing compute cost by 3x.
Google Search ranking: Adaptive batching increases window from 2ms to 6ms during peak traffic, forming batches of 48 to 64 instead of 16, doubling throughput without violating 100ms SLA.
← Back to Batch Size & Throughput Tuning Overview