Model Serving & Inference • Multi-model ServingMedium⏱️ ~2 min
Dynamic Batching in Multi-Model Serving
Dynamic batching groups multiple inference requests into a single forward pass to maximize GPU utilization, commonly yielding 1.5 to 5x throughput improvement at the cost of added queuing latency. In multi-model serving, batching happens per model: requests for the same model wait in a queue until a batch size threshold or timeout is reached, then execute together. This is critical because GPUs achieve peak efficiency when processing many inputs simultaneously, but naive one-at-a-time inference leaves hardware idle between requests.
The core mechanism is a per model queue with two triggers: batch size (for example, accumulate 8 requests) and max batch delay (for example, wait no more than 10 milliseconds). Whichever trigger fires first causes the batch to execute. For interactive APIs with strict latency budgets, the timeout is the critical parameter. TorchServe users commonly set 5 to 15 millisecond max delays, accepting small added latency to gain 2 to 3x throughput for computer vision models on GPU. For offline batch scoring, you can increase the timeout to 100+ milliseconds to build larger batches and maximize throughput.
The failure mode is head of line blocking: if one slow or large model dominates a shared batch queue, it delays fast models waiting behind it. In multi-model serving, this is worse because heterogeneous models with vastly different inference times (a 5ms image classifier versus a 200ms natural language processing encoder) can share workers. The symptom is p95 latency inflation for fast models. The solution is per model queues with isolated batching, ensuring each model accumulates its own batches independently, or partitioning workers by model size class (small/fast models on one pool, large/slow on another).
Another edge case: batch size variance. If traffic is spiky, some batches are full (hitting size limit) while others are tiny singletons (hitting timeout). Full batches get high throughput but singletons waste GPU cycles. Production systems mitigate this with adaptive batch sizing: monitor queue depth and adjust batch parameters dynamically, or use continuous batching techniques where the system starts inference as soon as one request arrives and opportunistically merges in new requests that arrive during execution.
💡 Key Takeaways
•Dynamic batching yields 1.5 to 5x GPU throughput improvement by grouping requests, commonly adding 5 to 20 milliseconds of queuing latency with 10ms timeout settings
•Per model queues prevent head of line blocking where slow models delay fast ones; without isolation, a 200ms NLP model can inflate p95 for a 5ms image classifier
•Max batch delay timeout is the critical parameter: set 5 to 15ms for interactive APIs to bound added latency, or 100+ ms for offline scoring to maximize throughput
•TorchServe users report 2 to 3x throughput gains on GPU for computer vision models with dynamic batching enabled, accepting small p50 latency increase from queuing
•Failure mode is batch size variance under spiky traffic: full batches achieve high efficiency while singleton batches waste GPU; adaptive sizing adjusts batch parameters based on queue depth
📌 Examples
Meta TorchServe serving 30 image classification models per GPU with per model batching: max batch size 16, timeout 10ms, achieving 150 QPS per model versus 60 QPS without batching
Netflix recommendation model with dynamic batching: queue delay adds 8ms p50, but throughput increases from 40 to 120 requests per second on single GPU, reducing fleet size by 60%
Fraud detection system with mixed model sizes: fast 5ms models in one worker pool with tight 5ms timeout, slow 200ms models in separate pool with 50ms timeout to prevent cross contamination