Dynamic Batching for Low Latency GPU Inference
How Dynamic Batching Works
A batcher sits between the request queue and the inference engine. It collects requests until either: the batch reaches maximum size, or a timeout expires. Parameters: max_batch_size (typically 16-64), max_delay_ms (typically 5-20ms). When triggered, the batch is padded to uniform shape (if needed) and sent to GPU. Results are returned to individual request handlers.
Padding and Efficiency
Variable-length inputs (text, sequences) require padding to the longest item in the batch. A batch of texts with lengths [10, 20, 100] pads to [100, 100, 100], wasting 70% of compute. Solutions: bucket by length (group similar-sized inputs), use sequence bucketing (e.g., buckets at 32, 64, 128, 256 tokens), or implement packed batching (concatenate sequences with separators, no padding needed).
Implementation Options
Triton Inference Server: built-in dynamic batching with configurable parameters. TorchServe: requires custom batching handler. Custom: async queue with batch formation logic. Key metrics: batch fill rate (how full are batches on average), timeout trigger rate (how often timeout fires before batch fills). Low fill rate with frequent timeouts suggests max_delay is too short or traffic is too sparse for batching.