Dynamic Batching for Low Latency GPU Inference

Definition
Dynamic batching collects inference requests that arrive within a time window and processes them as a single GPU batch. Unlike static batching (fixed batch size), dynamic batching adapts to traffic patterns and latency constraints.
How Dynamic Batching Works
A batcher sits between the request queue and the inference engine. It collects requests until either: the batch reaches maximum size, or a timeout expires. Parameters: max_batch_size (typically 16-64), max_delay_ms (typically 5-20ms). When triggered, the batch is padded to uniform shape (if needed) and sent to GPU. Results are returned to individual request handlers.
Padding and Efficiency
Variable-length inputs (text, sequences) require padding to the longest item in the batch. A batch of texts with lengths [10, 20, 100] pads to [100, 100, 100], wasting 70% of compute. Solutions: bucket by length (group similar-sized inputs), use sequence bucketing (e.g., buckets at 32, 64, 128, 256 tokens), or implement packed batching (concatenate sequences with separators, no padding needed).
Implementation Options
Triton Inference Server: built-in dynamic batching with configurable parameters. TorchServe: requires custom batching handler. Custom: async queue with batch formation logic. Key metrics: batch fill rate (how full are batches on average), timeout trigger rate (how often timeout fires before batch fills). Low fill rate with frequent timeouts suggests max_delay is too short or traffic is too sparse for batching.

💡 Key Takeaways

✓Dynamic batching collects requests until max_batch_size or max_delay_ms, whichever first

✓Typical parameters: max_batch_size 16-64, max_delay_ms 5-20ms

✓Padding waste: [10, 20, 100] tokens pads to [100, 100, 100], wasting 70% compute

✓Solutions: sequence bucketing (32, 64, 128, 256 buckets) or packed batching (no padding)

✓Monitor batch fill rate and timeout trigger rate to tune parameters

📌 Interview Tips

1Explain the two-parameter model (max_batch_size, max_delay_ms) when discussing batching design

2Describe padding waste with concrete example (70% waste) and bucketing solution

3Mention Triton"s built-in dynamic batching versus custom implementations

← Back to Batch Size & Throughput Tuning Overview