Natural Language Processing SystemsText Generation (Beam Search, Sampling, Decoding)Medium⏱️ ~3 min

Greedy vs Beam Search Decoding

At each decoding step, the language model outputs a probability distribution over its entire vocabulary for the next token. Greedy decoding simply picks the highest probability token (argmax) at every step. It is fast and deterministic, requiring no extra memory beyond a single sequence, but it often gets trapped in locally optimal patterns that lead to repetitive or generic text. Beam search maintains multiple candidate sequences (beams) simultaneously to explore more of the probability space. At each step, it expands every hypothesis by considering all possible next tokens, scores each continuation, then keeps only the top B sequences where B is the beam width. For a 7 billion parameter model, beam width 4 increases key value (KV) cache memory about 4 times per request. With per token KV cache around 0.5 MB and 200 tokens generated, one beam search request consumes roughly 400 MB compared to 100 MB for greedy decoding. Length normalization is critical for beam search because raw log probabilities favor shorter sequences. A common formula divides cumulative log probability by a length penalty term with alpha tuned between 0.6 and 1.0. Google's neural machine translation used beam widths 4 to 8 with tuned length penalties to improve BLEU scores. The tradeoff is clear: beam search improves quality for tasks with single right answers like translation, but the memory cost directly reduces how many concurrent users fit on one GPU. Choose greedy decoding for tool calls and structured extraction where speed matters most. Use beam search with width 2 to 8 for translation, summarization, and constrained generation where accuracy dominates and you can afford the throughput hit.
💡 Key Takeaways
Greedy decoding picks argmax token at each step with 100 MB memory for 200 tokens, while beam width 4 uses 400 MB for the same output length on a 7B parameter model
Beam search requires length normalization with alpha 0.6 to 1.0 because raw log probability favors shorter sequences and can cause premature termination
Memory cost scales linearly with beam width: each beam stores full KV cache at roughly 0.5 MB per token for typical 7B models, directly reducing concurrent batch size
Google neural machine translation improved BLEU scores using beam width 4 to 8 with tuned length penalties, demonstrating quality gains for single answer tasks
Production chat endpoints from OpenAI and Anthropic avoid beam search because it reduces throughput by 4x to 8x, preferring sampling methods that keep one hypothesis per user
📌 Examples
Translation task with beam width 4: Input prompt uses 50 tokens of KV cache (25 MB), then generates 200 tokens across 4 beams consuming 400 MB total, limiting GPU to 15 concurrent requests instead of 60 with greedy
Greedy decoding for JSON extraction: Model generates {"name": "the the the..."} repetitive loop because argmax at each step locally maximizes probability without global planning
Beam search with length penalty alpha=0.8 for summarization: Without penalty, beam collapses to 10 token output scoring higher than proper 50 token summary due to fewer probability multiplications
← Back to Text Generation (Beam Search, Sampling, Decoding) Overview
Greedy vs Beam Search Decoding | Text Generation (Beam Search, Sampling, Decoding) - System Overflow