What is Batching and Why Does It Improve Throughput?
Why Batching Improves Throughput
Every operation has fixed overhead: network round-trips, kernel launches, memory allocation. Processing one item takes 5ms overhead plus 1ms compute. Processing 32 items takes 5ms overhead plus 10ms compute. Single-item throughput: 166 items/second. Batched throughput: 2133 items/second. The overhead amortizes across the batch, and parallel hardware (GPUs) processes multiple items simultaneously.
The Latency Trade-off
Batching increases individual request latency. If requests arrive at 100/second and you batch every 32, each request waits up to 320ms to form a batch. For real-time applications, this is unacceptable. The solution: bounded-time batching. Wait up to X milliseconds or until batch is full, whichever comes first. Typical bounds: 5-50ms for interactive applications, 100-500ms for batch processing.
Batching Domains
GPU inference: batch inputs to maximize tensor core utilization. Database operations: batch writes to reduce I/O overhead. API calls: batch requests to external services to stay under rate limits. Message queues: batch messages to reduce per-message overhead. Each domain has different optimal batch sizes: GPU inference (16-128), database writes (100-1000), API calls (10-100).