Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Medium⏱️ ~3 min

How Does Continuous Batching Work in LLM Serving?

THE PROBLEM WITH STATIC BATCHING

Traditional batching waits for a fixed number of requests, processes them together, and returns all results. This works poorly for LLM generation because requests have vastly different output lengths.

A request generating 10 tokens finishes quickly. A request generating 1000 tokens takes 100x longer. In static batching, the short request waits for the long request to complete before returning. This wastes GPU cycles and increases latency for short requests.

HOW CONTINUOUS BATCHING WORKS

Continuous batching processes requests at the token level, not the request level. At each generation step:

1. Generate one token for all active requests in the batch

2. If a request finishes (hit EOS token or max length), remove it from the batch

3. If the batch has space, add waiting requests

4. Repeat

Requests enter and exit the batch dynamically. Short requests complete and free resources without waiting for long requests. New requests start immediately when space is available.

EFFICIENCY GAINS

Continuous batching improves GPU utilization by 2-4x compared to static batching. It also reduces median latency for short requests dramatically since they do not wait for long requests.

The key metric is tokens-per-second throughput. Static batching might achieve 100 tokens/second. Continuous batching on the same hardware achieves 200-400 tokens/second because GPUs stay busy instead of waiting.

IMPLEMENTATION COMPLEXITY

Continuous batching requires sophisticated memory management. Each request has its own KV cache that grows as tokens are generated. The scheduler must track available memory, decide which requests to run, and handle preemption when memory is tight.

Frameworks like vLLM and TensorRT-LLM implement continuous batching. Building it from scratch is complex; using established frameworks is recommended.

✅ Best Practice: Use continuous batching for any production LLM serving. The efficiency gains are too significant to ignore. vLLM and TensorRT-LLM are proven implementations.
💡 Key Takeaways
Static batching: short requests wait for long requests; wastes GPU cycles, increases latency
Continuous batching: process per-token, requests enter/exit dynamically; 2-4x GPU utilization improvement
Requires sophisticated memory management for per-request KV caches; use vLLM or TensorRT-LLM implementations
📌 Interview Tips
1Interview Tip: Explain the problem with static batching using concrete examples: 10-token vs 1000-token requests.
2Interview Tip: Describe the continuous batching loop: generate token, remove finished, add new, repeat.
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview