ML Model OptimizationBatch Size & Throughput TuningEasy⏱️ ~3 min

What is Batching and Why Does It Improve Throughput?

Batching groups multiple units of work together before a single processing step. Instead of processing one item at a time, you accumulate items and process them together. This pattern appears throughout machine learning systems: grouping training examples before a gradient update, bundling user requests before GPU inference, or collecting messages before a network write. The throughput gain comes from amortizing fixed costs. Every operation has overhead that stays constant regardless of batch size: GPU kernel launches take microseconds, remote procedure calls (RPCs) have setup costs, disk seeks take milliseconds, and optimizer updates have fixed computation. When you process 10 items together, you pay these costs once instead of 10 times. Consider a concrete example. Processing one item takes 20 seconds end to end (0.05 items per second throughput). If you batch 10 items and complete them in 30 seconds, you achieve 0.333 items per second, which is 6.6 times faster. Real systems achieve even better gains through parallelism. Google and Meta report 2 to 4 times throughput increases in their GPU serving infrastructure using dynamic batching. The tradeoff is latency. Each item must wait for the batch to fill before processing starts. A cloud function processing one message at a time responds immediately. When batching 100 messages, earlier messages wait while later ones arrive. This waiting time is the price you pay for higher throughput and lower cost per item.
💡 Key Takeaways
Batching amortizes fixed overhead costs like GPU kernel launches, RPC setup, disk seeks, and optimizer updates across multiple items
Throughput measured in items per second increases dramatically while cost per item decreases, often by 2 to 10 times in production systems
Individual item latency increases because items wait in a queue for the batch to fill before processing begins
Real production gains at Google and Meta show 2 to 4 times throughput improvement in GPU inference serving with dynamic batching
The technique applies across training (grouping examples), inference (grouping requests), and data pipelines (grouping messages)
📌 Examples
Cloud function case study: Processing 1 million messages took 27.7 hours at 1 message per invocation. Switching to 100 messages per invocation reduced time to 1.4 hours with 100 times lower cost due to per invocation billing.
GPU inference: Single item processing achieves 30% device utilization. Batching 16 to 64 requests increases utilization to 75% and reduces per request cost by 2 to 5 times.
Message queue producer: Batching records into tens of kilobytes before sending reduces broker CPU and network calls by 5 to 10 times compared to per message transmission.
← Back to Batch Size & Throughput Tuning Overview