Natural Language Processing SystemsText Generation (Beam Search, Sampling, Decoding)Easy⏱️ ~3 min

Greedy vs Beam Search Decoding

Definition
Decoding is how a language model converts its internal probability predictions into actual text tokens. At each step, the model outputs probabilities for every possible next word (typically 50,000+ options). Decoding is the strategy for picking which word to actually use.

Greedy Decoding

The simplest approach: always pick the highest probability token. If the model says "the" has 0.3 probability, "a" has 0.2, and "an" has 0.1, greedy picks "the." Repeat for every position until you hit an end token or max length.

Why it fails: Greedy gets stuck in local optima. The highest probability first word might lead to a dead end. "The" might score 0.3 now, but the sentence "A brilliant idea emerged" could have higher overall probability than "The idea was good." Greedy cannot see ahead.

Beam Search

Instead of keeping one candidate, keep the top K candidates (called beam width). At each step, expand all K candidates by considering their top tokens, then prune back to K total. A beam width of 5 means tracking 5 partial sequences simultaneously.

How it works: Start with "The" (0.3), "A" (0.2), "An" (0.1) as your 3 beams. Expand each: "The idea" (0.15), "The cat" (0.12), "A brilliant" (0.18), etc. Keep top 3 overall. After 20 tokens, pick the sequence with highest cumulative probability.

Beam Search Trade-offs

⚠️ Compute Cost: Beam width K means K× more computation per step. Beam 5 is 5× slower than greedy. For 100-token generation at 50ms per token, that is 5 seconds vs 25 seconds.

Quality vs cost: Beam 3-5 typically improves BLEU scores by 1-3 points on translation tasks. Beyond beam 10, gains diminish while costs keep rising. For creative text, beam search often produces repetitive, generic outputs because it optimizes for probability, not diversity.

💡 Key Takeaways
Decoding converts model probability outputs into actual text tokens, one token at a time
Greedy decoding always picks highest probability token but gets stuck in local optima
Beam search tracks K candidates simultaneously, finding better overall sequences
Beam width K means K× compute cost: beam 5 is 5× slower than greedy
Beam search improves translation quality but produces repetitive text for creative tasks
📌 Interview Tips
1Explain greedy's local optima problem: highest first word may lead to worse overall sentence
2Show beam search mechanics: expand K candidates, prune to K, pick best final sequence
3Mention that beam 3-5 improves BLEU by 1-3 points but gains diminish beyond beam 10
← Back to Text Generation (Beam Search, Sampling, Decoding) Overview