Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→2 of 6
Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Medium⏱️ ~3 min
How Does Continuous Batching Work in LLM Serving?
Static batching waits for all sequences in a batch to complete before starting new ones, causing head of line blocking where fast requests wait for slow ones. Continuous batching solves this by dynamically changing the batch composition at each step. New requests are admitted as soon as memory allows, and finished sequences are removed immediately. This keeps GPU utilization high and dramatically reduces idle time.
LLM serving has two distinct phases with different performance characteristics. Prefill processes the entire input prompt and populates the KV cache. It is compute bound because it processes many tokens in parallel. Decode generates tokens one at a time and is memory bandwidth bound because each step must fetch large KV tensors from memory with minimal compute per byte. Continuous batching interleaves these heterogeneous workloads, mixing long prefill bursts with short decode steps across many concurrent requests to maximize tokens per second.
The scheduler maintains two queues: prefill jobs that can be chunked into windows of at most T tokens, and decode jobs that add one token per step. Each iteration admits as many decode jobs as the maximum concurrent sequence limit allows, then fills remaining compute budget with prefill chunks up to a cap on total batched tokens. Chunked prefill breaks very long prompts into smaller units that interleave with decode steps, preventing a single long prompt from monopolizing kernels and causing inter token latency spikes for other users.
Production implementations use paged KV allocation to minimize memory waste. vLLM style systems limit waste to under 4 percent compared to naive contiguous allocation that often wastes 50 to 80 percent when fixed buffers are over provisioned for variable length sequences. The scheduler must enforce admission control that accounts for current KV occupancy and predicted output lengths. Simple policies like shortest remaining prefill first improve fairness, while earliest deadline first targets latency service level agreements (SLAs). The key failure mode is admitting too many sequences and running out of memory mid generation, requiring request aborts.
💡 Key Takeaways
•Continuous batching removes finished sequences immediately and admits new requests as soon as memory allows, eliminating head of line blocking from static batching
•Prefill steps are compute bound and process many tokens in parallel, while decode steps are memory bandwidth bound and generate one token per sequence per step
•Chunked prefill breaks long prompts into smaller windows that interleave with decode, keeping inter token latency low and preventing single requests from monopolizing kernels
•Paged KV allocation as in vLLM limits memory waste to under 4 percent versus 50 to 80 percent waste from naive contiguous buffers with fixed over provisioning
•Scheduler enforces admission control based on current KV occupancy and predicted output lengths to prevent out of memory failures mid generation
•Typical policies include shortest remaining prefill first for fairness or earliest deadline first for latency SLAs, with online adjustment of chunk sizes based on inter token latency percentiles
📌 Examples
Single 80 GB GPU serving 7B model: Reserve 50 GB for KV cache, cap at 64 concurrent sequences with 800 token limit each, chunk prefill to 512 token windows
Scheduler iteration: Admit 60 decode jobs (60 sequences × 1 token each), fill remaining budget with 2 prefill chunks of 512 tokens each (1024 total tokens)
vLLM paged attention: Allocate KV in 16 token blocks, sequences share prefix blocks for beam search, memory waste drops from 50% to under 4%
Long prompt handling: 10,000 token prompt chunked into 20 windows of 500 tokens each, interleaved with decode steps from 50 other sequences to keep their inter token latency below 100ms