Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Medium⏱️ ~3 min
Temperature and Nucleus Sampling (Top P)
Sampling based decoding injects controlled randomness to increase diversity. Temperature rescales the logits before the softmax function. With temperature T, divide each logit by T, then apply softmax. Temperature below 1.0 sharpens the distribution, making high probability tokens even more likely and producing conservative outputs. Temperature above 1.0 flattens the distribution, spreading probability mass to less likely tokens and increasing creativity but also incoherence risk. Temperature exactly 1.0 leaves the distribution unchanged.
Top p sampling, also called nucleus sampling, dynamically selects the smallest set of tokens whose cumulative probability reaches threshold p, then samples within that set after renormalizing. Unlike top k which always picks exactly k candidates, top p adapts the candidate set size to the shape of the distribution. When the model is confident, top p might select only 5 tokens. When uncertain, it might include 200 tokens. This adaptive behavior makes top p widely preferred for open ended generation.
Production systems combine temperature with top p. OpenAI defaults to temperature 1.0 and allows user override. Anthropic's Claude uses temperature around 1.0 with top p near 0.95 for chat. Typical ranges are temperature 0.7 to 1.0 and top p 0.9 to 0.97. Going below temperature 0.5 or top p 0.85 makes outputs very repetitive and generic. Going above temperature 1.5 or using top p 0.99 often produces incoherent tangents. Very high temperature with very small top p can create a trap: you flatten probabilities over a tiny candidate set, causing random oscillation between just a few tokens.
Sampling keeps one hypothesis per request, so memory scales linearly with concurrent users, not beam width. For a 7B model generating 200 tokens, one sampled sequence uses about 100 MB of KV cache. This allows continuous batching schedulers to pack 60 users per GPU instead of 15 with beam width 4. The tradeoff is that sampling sacrifices the guaranteed high likelihood path that beam search explores, but user studies consistently show people prefer diverse sampled outputs over safe beam search results in chat and creative tasks.
💡 Key Takeaways
•Temperature divides logits before softmax: values below 1.0 sharpen the distribution, above 1.0 flatten it. Temperature 0.5 makes top token 1.4x more likely, while 2.0 spreads probability more evenly
•Top p (nucleus sampling) adapts candidate set size to distribution confidence: might select 5 tokens when model is certain, 200 when uncertain, unlike fixed top k
•Production defaults are temperature 0.7 to 1.0 and top p 0.9 to 0.97. OpenAI and Anthropic use these ranges for chat, prioritizing diversity over pure likelihood maximization
•Sampling uses 100 MB KV cache per 200 token sequence on 7B models, allowing 60 concurrent users per GPU versus 15 with beam width 4, improving throughput 4x
•Very high temperature with very small top p creates a failure mode: flattening probabilities over a tiny set causes random oscillation between few tokens with no coherent pattern
📌 Examples
Claude chat with temperature 1.0 and top p 0.95: At one step, top 50 tokens sum to 0.95 probability, model samples "explained" instead of always picking "said", creating natural variety
Coding assistant with temperature 0.3 and top p 0.9: Sharper distribution favors common patterns like "return" and "if", reducing syntax errors but still allowing occasional creative solutions
Creative writing with temperature 1.5 and top p 0.98: Story generation includes unexpected words like "shimmering" instead of "bright", but some sentences become incoherent tangents about unrelated topics