ML Model OptimizationBatch Size & Throughput TuningEasy⏱️ ~3 min

What is Batching and Why Does It Improve Throughput?

Definition
Batching groups multiple independent operations into a single execution. Instead of processing requests one at a time, the system collects several requests and processes them together, amortizing fixed overhead across multiple items.

Why Batching Improves Throughput

Every operation has fixed overhead: network round-trips, kernel launches, memory allocation. Processing one item takes 5ms overhead plus 1ms compute. Processing 32 items takes 5ms overhead plus 10ms compute. Single-item throughput: 166 items/second. Batched throughput: 2133 items/second. The overhead amortizes across the batch, and parallel hardware (GPUs) processes multiple items simultaneously.

The Latency Trade-off

Batching increases individual request latency. If requests arrive at 100/second and you batch every 32, each request waits up to 320ms to form a batch. For real-time applications, this is unacceptable. The solution: bounded-time batching. Wait up to X milliseconds or until batch is full, whichever comes first. Typical bounds: 5-50ms for interactive applications, 100-500ms for batch processing.

Batching Domains

GPU inference: batch inputs to maximize tensor core utilization. Database operations: batch writes to reduce I/O overhead. API calls: batch requests to external services to stay under rate limits. Message queues: batch messages to reduce per-message overhead. Each domain has different optimal batch sizes: GPU inference (16-128), database writes (100-1000), API calls (10-100).

💡 Key Takeaways
Batching amortizes fixed overhead (5ms) across multiple items, enabling 10x+ throughput gains
Single-item: 166 items/sec; batch of 32: 2133 items/sec (13x improvement in example)
Trade-off: individual latency increases; use bounded-time batching (5-50ms for interactive)
Optimal batch sizes vary by domain: GPU (16-128), database (100-1000), API (10-100)
Wait up to X ms or until batch full, whichever first - balances latency and throughput
📌 Interview Tips
1Give specific throughput calculation (166 vs 2133 items/sec) to demonstrate quantitative thinking
2Mention bounded-time batching as the solution to latency trade-off
3List domain-specific batch sizes to show breadth of batching experience
← Back to Batch Size & Throughput Tuning Overview