Greedy vs Beam Search Decoding
Greedy Decoding
The simplest approach: always pick the highest probability token. If the model says "the" has 0.3 probability, "a" has 0.2, and "an" has 0.1, greedy picks "the." Repeat for every position until you hit an end token or max length.
Why it fails: Greedy gets stuck in local optima. The highest probability first word might lead to a dead end. "The" might score 0.3 now, but the sentence "A brilliant idea emerged" could have higher overall probability than "The idea was good." Greedy cannot see ahead.
Beam Search
Instead of keeping one candidate, keep the top K candidates (called beam width). At each step, expand all K candidates by considering their top tokens, then prune back to K total. A beam width of 5 means tracking 5 partial sequences simultaneously.
How it works: Start with "The" (0.3), "A" (0.2), "An" (0.1) as your 3 beams. Expand each: "The idea" (0.15), "The cat" (0.12), "A brilliant" (0.18), etc. Keep top 3 overall. After 20 tokens, pick the sequence with highest cumulative probability.
Beam Search Trade-offs
Quality vs cost: Beam 3-5 typically improves BLEU scores by 1-3 points on translation tasks. Beyond beam 10, gains diminish while costs keep rising. For creative text, beam search often produces repetitive, generic outputs because it optimizes for probability, not diversity.