Dynamic Batching in Multi-Model Serving
Why Batching Matters
Dynamic batching groups multiple inference requests into a single forward pass to maximize GPU utilization, commonly yielding 1.5 to 5x throughput improvement at the cost of added queuing latency. In multi-model serving, batching happens per model: requests for the same model wait in a queue until a batch size threshold or timeout is reached, then execute together. GPUs achieve peak efficiency when processing many inputs simultaneously, but naive one-at-a-time inference leaves hardware idle between requests.
The Batching Mechanism
The core mechanism is a per model queue with two triggers: batch size (for example, accumulate 8 requests) and max batch delay (for example, wait no more than 10ms). Whichever trigger fires first causes the batch to execute. For interactive APIs with strict latency budgets, the timeout is the critical parameter. TorchServe users commonly set 5 to 15ms max delays, accepting small added latency to gain 2 to 3x throughput for vision models on GPU. For offline batch scoring, you can increase the timeout to 100+ ms to build larger batches.
Head of Line Blocking
If one slow or large model dominates a shared batch queue, it delays fast models waiting behind it. In multi-model serving, this is worse because heterogeneous models with vastly different inference times (a 5ms image classifier versus a 200ms NLP encoder) can share workers. The symptom is p95 latency inflation for fast models. The solution is per model queues with isolated batching, ensuring each model accumulates its own batches independently, or partitioning workers by model size class.
Batch Size Variance
If traffic is spiky, some batches are full (hitting size limit) while others are tiny singletons (hitting timeout). Full batches get high throughput but singletons waste GPU cycles. Production systems mitigate this with adaptive batch sizing: monitor queue depth and adjust batch parameters dynamically, or use continuous batching where the system starts inference as soon as one request arrives and opportunistically merges in new requests during execution.