Production Serving Pipeline with Token Streaming
Why Streaming Matters
A 500-token response at 50ms per token takes 25 seconds to generate. Without streaming, users stare at a blank screen for 25 seconds. With streaming, they see the first token in 50ms and watch text appear progressively. Perceived latency drops from 25 seconds to under 100ms.
Streaming is not optional for interactive applications. Users abandon chatbots that take more than 3-5 seconds to show any response. Streaming makes 25-second generations feel instant.
Implementation Architecture
Server-Sent Events (SSE): The standard approach. Server keeps HTTP connection open and pushes each token as a text event. Client receives tokens progressively and appends to display. Connection stays open until generation completes or client disconnects.
Token batching: Sending every single token as a network packet is inefficient. Batch 3-5 tokens together before sending. This reduces network overhead while keeping perceived latency low. Users cannot distinguish single-token vs 3-token batches at 50ms intervals.
KV Cache Management
Each generation step needs the key-value (KV) pairs from all previous tokens. Without caching, you recompute attention for all prior tokens at every step. With KV cache, you only compute the new token and append to the cache.
Continuous Batching
Traditional batching waits for a batch to complete before starting new requests. Continuous batching inserts new requests into an ongoing batch whenever a request finishes. GPU utilization jumps from 30-40% to 80-90%.
The trick: different requests have different lengths. Some finish in 50 tokens, others take 500. Continuous batching fills the freed slot immediately rather than waiting for the longest request.