Learn→Natural Language Processing Systems→Text Generation (Beam Search, Sampling, Decoding)→3 of 6

Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Medium⏱️ ~3 min

Production Serving Pipeline with Token Streaming

Why Streaming Matters
A 500-token response at 50ms per token takes 25 seconds to generate. Without streaming, users stare at a blank screen for 25 seconds. With streaming, they see the first token in 50ms and watch text appear progressively. Perceived latency drops from 25 seconds to under 100ms.
Streaming is not optional for interactive applications. Users abandon chatbots that take more than 3-5 seconds to show any response. Streaming makes 25-second generations feel instant.
Implementation Architecture
Server-Sent Events (SSE): The standard approach. Server keeps HTTP connection open and pushes each token as a text event. Client receives tokens progressively and appends to display. Connection stays open until generation completes or client disconnects.
Token batching: Sending every single token as a network packet is inefficient. Batch 3-5 tokens together before sending. This reduces network overhead while keeping perceived latency low. Users cannot distinguish single-token vs 3-token batches at 50ms intervals.
KV Cache Management
Each generation step needs the key-value (KV) pairs from all previous tokens. Without caching, you recompute attention for all prior tokens at every step. With KV cache, you only compute the new token and append to the cache.
💡 Memory Math: A 7B parameter model with 2048 context length needs ~16GB for KV cache per request. At 100 concurrent requests, that is 1.6TB of GPU memory just for caches.
Continuous Batching
Traditional batching waits for a batch to complete before starting new requests. Continuous batching inserts new requests into an ongoing batch whenever a request finishes. GPU utilization jumps from 30-40% to 80-90%.
The trick: different requests have different lengths. Some finish in 50 tokens, others take 500. Continuous batching fills the freed slot immediately rather than waiting for the longest request.

💡 Key Takeaways

✓Streaming reduces perceived latency from 25 seconds to under 100ms for the first token

✓Server-Sent Events (SSE) push tokens progressively; batch 3-5 tokens to reduce network overhead

✓KV cache stores key-value pairs from prior tokens, avoiding recomputation at each step

✓7B model with 2048 context needs ~16GB KV cache per request; 100 requests = 1.6TB

✓Continuous batching fills freed slots immediately, boosting GPU utilization from 30% to 80-90%

📌 Interview Tips

1Explain why streaming is essential: 25-second wait vs progressive display starting at 50ms

2Show KV cache memory math: 16GB per request × 100 concurrent = 1.6TB

3Describe continuous batching: insert new requests when others finish, not when batch completes

← Back to Text Generation (Beam Search, Sampling, Decoding) Overview