Natural Language Processing SystemsText Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~3 min

Decoding Failure Modes and Safety Controls

Repetition Loops

The model generates "I think that I think that I think that..." endlessly. This happens because once a phrase appears, its probability increases for the next position. Beam search amplifies this: the repetitive sequence accumulates high probability and dominates all beams.

Fix: Repetition penalty. Reduce the probability of tokens that already appeared in the output. A penalty of 1.2 means previously-seen tokens get their logits divided by 1.2. Too high (2.0+) causes the model to avoid legitimate repetition like pronouns.

Length Degeneration

Beam search favors shorter sequences because probability accumulates multiplicatively. A 10-token sequence with 0.9 per-token probability scores higher (0.9^10 = 0.35) than a 20-token sequence with same per-token probability (0.9^20 = 0.12).

Fix: Length normalization. Divide final score by sequence length or length raised to a power (alpha). Alpha of 0.6-0.8 balances brevity preference against completion. Without this, the model outputs terse, incomplete responses.

Sampling Collapse

High temperature plus high top_p occasionally samples an extremely low probability token. Once one bad token enters the sequence, the model has no good continuations, and output quality collapses into nonsense.

⚠️ Prevention: Use both temperature and top_p together. Temperature 0.7 reshapes distribution, then top_p 0.9 filters out the long tail. Never use temperature 2.0 with top_p 1.0 in production.

Safety Controls

Output filtering: Run generated text through a classifier before returning. Block responses containing harmful content, personally identifiable information, or policy violations. Latency cost: 10-50ms per response.

Logit bias: Increase or decrease probability of specific tokens during generation. Set certain tokens to negative infinity to make them impossible to generate. Used for preventing profanity, brand names, or competitor mentions.

💡 Key Takeaways
Repetition loops occur when prior tokens boost their own probability; fix with repetition penalty 1.2
Beam search favors shorter sequences due to multiplicative probability; fix with length normalization
Sampling collapse: one bad token derails entire output; use temperature + top_p together
Output filtering adds 10-50ms latency but catches harmful content before returning
Logit bias prevents specific tokens by setting their probability to negative infinity
📌 Interview Tips
1Explain repetition penalty: divide logits of seen tokens by 1.2, but not higher than 2.0
2Show length normalization math: divide score by length^alpha where alpha is 0.6-0.8
3Never use temperature 2.0 with top_p 1.0 in production due to sampling collapse risk
← Back to Text Generation (Beam Search, Sampling, Decoding) Overview