Natural Language Processing SystemsText Generation (Beam Search, Sampling, Decoding)Medium⏱️ ~3 min

Temperature and Nucleus Sampling (Top P)

Core Concept
Sampling introduces randomness into decoding. Instead of always picking the highest probability token, you randomly sample from the distribution. Temperature and nucleus sampling control how much randomness.

Temperature Scaling

Temperature divides the raw model scores (logits) before converting to probabilities. If token A has logit 2.0 and token B has logit 1.0, the probability ratio depends on temperature. At temperature 1.0, standard probabilities apply. Lower temperature sharpens the distribution; higher temperature flattens it.

Temperature 0.5: Divide logits by 0.5 (multiply by 2). The gap between A and B doubles. A becomes more dominant. Output becomes more deterministic and predictable.

Temperature 2.0: Divide logits by 2. The gap shrinks. Lower probability tokens get more chance. Output becomes more random and creative, but also more likely to produce nonsense.

Nucleus Sampling (Top P)

Instead of a fixed number of candidates, nucleus sampling keeps tokens until their cumulative probability reaches threshold P. If P=0.9, keep adding tokens (highest first) until they sum to 90% probability, then sample only from that set.

Why this works: The number of reasonable next tokens varies by context. After "The capital of France is" only 1-2 tokens make sense. After "I feel" dozens might work. Top P adapts automatically: tight distributions yield few candidates, flat distributions yield many.

Practical Settings

💡 Common Configurations: Factual QA: temp 0.1-0.3, top_p 0.9. Creative writing: temp 0.7-1.0, top_p 0.95. Code generation: temp 0.2, top_p 0.95. Chatbots: temp 0.7, top_p 0.9.

Using both together: temperature reshapes probabilities first, then top_p filters. Temperature 0.7 with top_p 0.9 gives controlled creativity: slightly sharper distribution, filtered to reasonable options.

💡 Key Takeaways
Temperature divides logits before probability conversion: lower = sharper, higher = flatter distribution
Temperature 0.5 makes output deterministic; temperature 2.0 makes it random and creative
Nucleus sampling (top_p) keeps tokens until cumulative probability reaches threshold P
Top_p adapts automatically: tight contexts yield few candidates, open contexts yield many
Common settings: factual QA uses temp 0.1-0.3, creative writing uses temp 0.7-1.0
📌 Interview Tips
1Explain temperature mechanics: lower temp sharpens distribution, higher flattens it
2Show why top_p adapts: 'capital of France is' needs few candidates, 'I feel' needs many
3Give concrete settings: code generation uses temp 0.2, creative writing uses temp 0.7-1.0
← Back to Text Generation (Beam Search, Sampling, Decoding) Overview