Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~3 min
Decoding Failure Modes and Safety Controls
Repetition loops are a common failure mode in deterministic decoding. Greedy or large beam width strategies can get stuck generating the same phrase repeatedly because each local step maximizes probability without global planning. For example, the model might produce "the the the the" or cycle through "I think that I think that" indefinitely. Mitigations include repetition penalties that reduce logits for tokens already in the sequence, and no repeat n-gram constraints that block any n-gram appearing twice. These heuristics break exact probability semantics but are essential in practice.
Length bias in beam search causes premature termination or verbose padding. Without length normalization, shorter sequences accumulate less negative log probability and win the beam. A translation might stop at 5 tokens when 30 tokens are needed for adequacy. Applying a length penalty with alpha between 0.6 and 1.0 encourages longer outputs, but over correcting with alpha above 1.2 can create padding where the model generates filler words to maximize the length bonus. Task specific tuning is necessary.
End of sequence (EOS) tokens may never be generated under aggressive sampling or after penalties. If top p is set very low and the true EOS probability falls outside the nucleus, the model will never terminate naturally. Systems must enforce a maximum token limit, typically 2048 to 4096, and use stop sequences like specific role markers or format delimiters. Without these hard stops, requests can run until timeout, wasting GPU cycles and blocking other users. Meta's Llama and OpenAI's GPT models both enforce max tokens server side to prevent this failure.
Token level safety controls must align with the tokenizer. Blocking plain text words is insufficient because subword tokens can reconstruct banned phrases. For example, blocking "violence" does not prevent the tokens "vio" and "lence" from being sampled sequentially. Providers maintain token level blocklists and apply logit masking at every decode step. Anthropic and OpenAI also run post generation classifiers to catch unsafe content that slipped through, then refuse to return it.
Constrained generation for structured outputs like JSON or SQL can fail under unconstrained sampling. The model might generate a valid opening brace but then sample tokens that violate the schema, producing unmatched braces or missing required fields. Enforce constraints with a finite state machine or grammar that masks invalid tokens at each step. Libraries like Outlines and Guidance implement this by intersecting the vocabulary with the grammar's allowed next tokens. Post hoc repair by regex or parsers is brittle and fails on complex schemas.
Resource starvation occurs when a few requests monopolize GPU memory. Allowing beam width 10 or max tokens 8192 on a shared GPU can exhaust KV cache and evict other batches, causing P95 latencies to spike from 2 seconds to 30 seconds. Multi tenant schedulers must cap beam width, max tokens, and memory per request. Some providers implement tiered rate limits where self serve users get beam width 1 and max tokens 2048, while enterprise customers can request higher limits with reserved capacity.
💡 Key Takeaways
•Repetition loops occur in greedy and beam search when local argmax at each step creates cycles like "the the the". Repetition penalties reduce logits for seen tokens, breaking probability semantics but preventing degeneration
•Length bias without normalization favors short sequences. Translation output might stop at 5 tokens when 30 needed. Alpha 0.6 to 1.0 in length penalty encourages adequacy, but alpha above 1.2 creates verbose padding
•EOS token may never reach top p threshold under aggressive sampling. Hard max token limits (2048 to 4096) and stop sequences are mandatory to prevent runaway generation and GPU waste
•Token level safety masks must block subword tokens, not just plain text words. "violence" requires blocking tokens like "vio" and "lence" separately to prevent reconstruction during sampling
•Constrained generation for JSON or SQL requires finite state grammar at decode time to mask invalid tokens. Post hoc regex repair fails on complex schemas with nested structures
•Beam width 10 or max tokens 8192 can exhaust KV cache on shared GPUs, evicting other batches and spiking P95 latency from 2 seconds to 30 seconds. Schedulers must cap parameters per tier
📌 Examples
Repetition penalty scenario: Greedy decoding generates "I think that I think that", penalty factor 1.2 reduces logit for "I" and "think" by 20 percent, forcing model to sample "however" instead
Length bias in translation: Without penalty, beam search outputs "Le chat" (2 tokens, log prob negative 1.5) instead of correct "Le chat est sur le tapis" (6 tokens, log prob negative 4.2) because shorter is less negative
EOS failure with top p 0.85: True EOS probability is 0.10 but top p truncates at 0.85 cumulative, EOS never sampled, model generates 4096 tokens until hard limit, wasting 80 seconds of GPU time
Safety mask bypass: Blocklist prevents "kill" token 1234, but model samples subword tokens 12 ("ki") then 34 ("ll") sequentially, reconstructing banned word without triggering mask