Learn→Natural Language Processing Systems→Scalability (Model Parallelism, Batching)→2 of 6

Natural Language Processing Systems • Scalability (Model Parallelism, Batching)Medium⏱️ ~2 min

How Does Batching Improve Training and Inference Utilization?

How Batching Works
Instead of processing requests one at a time, batching groups multiple requests and processes them together in a single GPU operation. A GPU can process 1 item in 5ms or 32 items in 8ms. That is 160ms for 32 individual calls versus 8ms for one batch: 20x faster throughput. The efficiency comes from GPU architecture, which is optimized for parallel operations on large matrices.
The batching algorithm collects incoming requests into a queue, waits until either the batch reaches a target size (say, 32) or a timeout expires (say, 50ms), then sends the batch to the GPU. The wait adds latency to individual requests but dramatically improves overall throughput and cost efficiency.
Choosing Batch Parameters
Batch size is bounded by GPU memory. A 16GB GPU running a 7B parameter model might fit batches of 8-16 requests depending on sequence length. Larger batches improve throughput but require more memory. If your batch exceeds memory, the request fails with out of memory errors.
Timeout determines worst case latency. A 100ms timeout means every request waits at most 100ms before processing begins, plus inference time. For real-time applications requiring sub-200ms response, use short timeouts (20-50ms) and accept smaller batches. For batch processing where latency does not matter, use longer timeouts (seconds) and maximize batch fill.
⚠️ Key Trade-off: Larger batches = higher throughput + higher latency. Smaller batches = lower latency + wasted GPU capacity. Set batch size and timeout based on your latency SLA and traffic patterns.
Dynamic Batching
Static batch parameters waste resources when traffic varies. Dynamic batching adjusts based on current conditions. Under high load, batches fill quickly and timeout rarely triggers. Under low load, shorter timeouts prevent requests from waiting forever for a batch that never fills. Some systems adjust batch size based on request complexity - longer sequences get smaller batches to avoid memory issues.

💡 Key Takeaways

✓Batching groups requests: GPU processes 1 item in 5ms or 32 items in 8ms, yielding 20x throughput improvement

✓Batch size is bounded by GPU memory - a 16GB GPU with 7B model fits 8-16 requests depending on sequence length

✓Timeout sets worst case latency: 100ms timeout means every request waits up to 100ms before processing starts

✓Dynamic batching adjusts parameters based on load - high load fills batches quickly, low load uses shorter timeouts

📌 Interview Tips

1Lead with the efficiency numbers: 32 individual requests at 160ms total vs one batch at 8ms. That is the core insight.

2Explain the batch size memory constraint - batch too large causes OOM. Ask about their GPU specs and model size.

3Describe the latency trade-off: batch timeout directly determines worst case latency. Real-time needs 20-50ms, batch jobs can use seconds.

← Back to Scalability (Model Parallelism, Batching) Overview