Natural Language Processing SystemsScalability (Model Parallelism, Batching)Medium⏱️ ~2 min

How Does Batching Improve Training and Inference Utilization?

Batching groups multiple inputs into a single execution to amortize fixed costs like kernel launches, memory transfers, and scheduler overhead. In training, batch size determines memory usage, numerical stability, and convergence speed. In inference, batching trades queueing delay for dramatically higher throughput. During training, a larger batch processes more examples per gradient update, which improves GPU utilization by increasing arithmetic intensity. However, activation memory scales linearly with batch size. Gradient accumulation solves this by splitting a large effective batch into smaller micro batches, computing gradients separately, accumulating them, then updating parameters once. For example, an effective batch of 2048 can be simulated as 8 micro batches of 256, keeping peak memory under control while maintaining statistical properties of the larger batch. Inference batching works differently because requests arrive independently. Static batching waits a fixed time window, like 10 milliseconds, to collect requests and packs them into one batch. Dynamic batching adds length bucketing, grouping similar sequence lengths to avoid head of line blocking where short requests wait for long ones to finish. NVIDIA Triton reports that dynamic batching increases throughput by 2 to 3 times with only single digit millisecond queueing delay added. Continuous batching takes this further by maintaining a live decoding schedule. New sequences join the batch as soon as capacity frees up, and completed sequences exit immediately without waiting for others. This maximizes GPU occupancy under variable request patterns. OpenAI adopted continuous batching combined with paged key value (KV) caches to sustain high utilization at scale. For a 70 billion parameter model on 8 A100 GPUs, prefill throughput on an 8 thousand token prompt takes 1 to 2 seconds, then decode produces 600 to 1200 tokens per second aggregate depending on batch size and context length. The right batching strategy depends on latency budgets and workload characteristics. Interactive applications with 200 millisecond p50 targets use small batching windows and per tenant caps. Offline batch scoring maximizes batch size and window time to push throughput, accepting seconds of delay.
💡 Key Takeaways
Gradient accumulation simulates large effective batches by splitting into micro batches, computing gradients separately, then updating once to control activation memory
Dynamic batching in inference groups requests by length buckets to avoid head of line blocking where short sequences wait for long ones, improving throughput 2 to 3 times
Continuous batching adds and removes sequences dynamically as they arrive and finish, maximizing GPU occupancy under variable request patterns without waiting for batch completion
Training batch size affects convergence and memory, inference batch size trades queueing delay for throughput, with different optimal values for interactive versus offline workloads
NVIDIA Triton shows dynamic batching adds only single digit millisecond delay while increasing throughput, and OpenAI uses continuous batching with paged KV caches for high utilization
📌 Examples
Training a 70B model with effective batch 2048 split into 8 micro batches of 256 keeps peak activation memory under 40 GB per GPU instead of 320 GB
Inference serving with 10 millisecond batching window collects 4 to 8 requests, processes them together, increasing throughput from 150 to 500 tokens per second
Continuous batching on 8 A100 GPUs decoding at 1200 tokens per second aggregate handles variable length requests without idle time between batches
Bucketing prompts into 512, 1024, 2048, 4096 token bins prevents a 4096 token request from blocking eight 512 token requests, cutting p99 latency from 8 seconds to 2 seconds
← Back to Scalability (Model Parallelism, Batching) Overview
How Does Batching Improve Training and Inference Utilization? | Scalability (Model Parallelism, Batching) - System Overflow