Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~3 min
Production Serving Pipeline with Token Streaming
A production text generation system wraps the decoding algorithm inside a streaming pipeline optimized for low perceived latency. The flow starts with tokenization of the user prompt, then a prefill pass where the model processes all prompt tokens in parallel to populate the key value cache for every layer. After prefill, the system enters an iterative decode loop: generate one token, append it to the sequence, update the cache, and immediately stream that token to the client. Streaming begins as soon as the first token is ready, which keeps time to first byte low even if the full response takes seconds.
For a 7 to 13 billion parameter model on an A100 class GPU, first token latency is typically 150 to 500 milliseconds for short prompts when the system is warm. Decode speed ranges from 20 to 120 tokens per second per active sequence, depending on model size, precision (float16 vs int8), and how many sequences are batched together. For a 70 billion parameter model, decode speed drops to 5 to 20 tokens per second per sequence due to memory bandwidth bottlenecks. Service level objectives often target P95 time to first token under 400 milliseconds and P95 time to last token under 2 to 5 seconds for 200 tokens of output.
Continuous batching is the key to throughput. Instead of waiting for all requests in a batch to finish, the scheduler merges decode steps from many concurrent users at every iteration. A request that just arrived can join the batch immediately, and a request that finished can leave without blocking others. With sampling, each request contributes one hypothesis, which makes batching efficient. With beam search, each request contributes B hypotheses, fragmenting the batch and reducing utilization. This is why OpenAI, Anthropic, Google, and Meta default to nucleus sampling for chat endpoints and restrict or disallow beam search on public APIs.
Speculative decoding is often layered in to reduce latency further. A small draft model proposes several tokens, then the large model verifies all proposals in one forward pass. If the draft is accurate, the system accepts multiple tokens per step. Acceptances are common in predictable text regions, yielding 1.5 to 3 times speedup on average. This optimization integrates naturally with sampling but complicates beam search because you must speculate for multiple hypotheses simultaneously.
Reliability controls are essential. Set hard limits on max tokens per request, typically 2048 to 4096, to prevent runaway generation if the model never produces an end of sequence token. Enforce stop sequences that mark task completion, such as a closing JSON brace or a specific role token. Apply token level safety masks to block disallowed content, ensuring subword tokens cannot reconstruct banned phrases. Add circuit breakers that switch to greedy decoding or lower temperature when GPU utilization exceeds 85 percent, preserving latency under load spikes. Track time to first token, tokens per second, and P95 tail latencies separately for prefill and decode phases to identify bottlenecks.
💡 Key Takeaways
•First token latency is 150 to 500 milliseconds for 7B to 13B models on A100 GPUs for short prompts, with decode speed 20 to 120 tokens per second depending on batching and precision
•Continuous batching merges decode steps from concurrent users at every iteration, allowing new requests to join immediately and finished requests to leave without blocking others
•Sampling allows one hypothesis per request and efficient batching, while beam width 4 multiplies cache usage 4x and reduces concurrent users from 60 to 15 on typical GPUs
•Speculative decoding with a small draft model proposing tokens and large model verifying yields 1.5 to 3x speedup in predictable regions, integrating naturally with sampling
•Production SLOs target P95 time to first token under 400 milliseconds and P95 time to last token under 2 to 5 seconds for 200 token responses in chat endpoints
•Hard limits on max tokens (2048 to 4096) and stop sequences prevent runaway generation when end of sequence token probability remains low throughout decoding
📌 Examples
OpenAI chat endpoint with 13B model: User sends 50 token prompt, prefill takes 200ms, then streams 150 tokens at 70 tokens/sec, total time to last token is 200ms + 2.1sec = 2.3 seconds
Continuous batching scenario: GPU serves 40 concurrent users, each at different decode steps. User A finishes at step 50, user B joins immediately, batch size stays at 40 without waiting
Speculative decoding on coding task: Draft 1B model proposes "return None" (2 tokens), large 7B model verifies both in one pass, accepts both, saving one decode step and 20ms