Learn→Natural Language Processing Systems→Text Generation (Beam Search, Sampling, Decoding)→4 of 6

Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~3 min

Decoding Failure Modes and Safety Controls

Repetition Loops
The model generates "I think that I think that I think that..." endlessly. This happens because once a phrase appears, its probability increases for the next position. Beam search amplifies this: the repetitive sequence accumulates high probability and dominates all beams.
Fix: Repetition penalty. Reduce the probability of tokens that already appeared in the output. A penalty of 1.2 means previously-seen tokens get their logits divided by 1.2. Too high (2.0+) causes the model to avoid legitimate repetition like pronouns.
Length Degeneration
Beam search favors shorter sequences because probability accumulates multiplicatively. A 10-token sequence with 0.9 per-token probability scores higher (0.9^10 = 0.35) than a 20-token sequence with same per-token probability (0.9^20 = 0.12).
Fix: Length normalization. Divide final score by sequence length or length raised to a power (alpha). Alpha of 0.6-0.8 balances brevity preference against completion. Without this, the model outputs terse, incomplete responses.
Sampling Collapse
High temperature plus high top_p occasionally samples an extremely low probability token. Once one bad token enters the sequence, the model has no good continuations, and output quality collapses into nonsense.
⚠️ Prevention: Use both temperature and top_p together. Temperature 0.7 reshapes distribution, then top_p 0.9 filters out the long tail. Never use temperature 2.0 with top_p 1.0 in production.
Safety Controls
Output filtering: Run generated text through a classifier before returning. Block responses containing harmful content, personally identifiable information, or policy violations. Latency cost: 10-50ms per response.
Logit bias: Increase or decrease probability of specific tokens during generation. Set certain tokens to negative infinity to make them impossible to generate. Used for preventing profanity, brand names, or competitor mentions.

💡 Key Takeaways

✓Repetition loops occur when prior tokens boost their own probability; fix with repetition penalty 1.2

✓Beam search favors shorter sequences due to multiplicative probability; fix with length normalization

✓Sampling collapse: one bad token derails entire output; use temperature + top_p together

✓Output filtering adds 10-50ms latency but catches harmful content before returning

✓Logit bias prevents specific tokens by setting their probability to negative infinity

📌 Interview Tips

1Explain repetition penalty: divide logits of seen tokens by 1.2, but not higher than 2.0

2Show length normalization math: divide score by length^alpha where alpha is 0.6-0.8

3Never use temperature 2.0 with top_p 1.0 in production due to sampling collapse risk

← Back to Text Generation (Beam Search, Sampling, Decoding) Overview