How Does Continuous Batching Work in LLM Serving?
THE PROBLEM WITH STATIC BATCHING
Traditional batching waits for a fixed number of requests, processes them together, and returns all results. This works poorly for LLM generation because requests have vastly different output lengths.
A request generating 10 tokens finishes quickly. A request generating 1000 tokens takes 100x longer. In static batching, the short request waits for the long request to complete before returning. This wastes GPU cycles and increases latency for short requests.
HOW CONTINUOUS BATCHING WORKS
Continuous batching processes requests at the token level, not the request level. At each generation step:
1. Generate one token for all active requests in the batch
2. If a request finishes (hit EOS token or max length), remove it from the batch
3. If the batch has space, add waiting requests
4. Repeat
Requests enter and exit the batch dynamically. Short requests complete and free resources without waiting for long requests. New requests start immediately when space is available.
EFFICIENCY GAINS
Continuous batching improves GPU utilization by 2-4x compared to static batching. It also reduces median latency for short requests dramatically since they do not wait for long requests.
The key metric is tokens-per-second throughput. Static batching might achieve 100 tokens/second. Continuous batching on the same hardware achieves 200-400 tokens/second because GPUs stay busy instead of waiting.
IMPLEMENTATION COMPLEXITY
Continuous batching requires sophisticated memory management. Each request has its own KV cache that grows as tokens are generated. The scheduler must track available memory, decide which requests to run, and handle preemption when memory is tight.
Frameworks like vLLM and TensorRT-LLM implement continuous batching. Building it from scratch is complex; using established frameworks is recommended.