Training Recipe: Loss Functions, Temperature, and Data Pipelines

Core Concept
Temperature controls how much the student learns from teacher uncertainty. Higher values (3-20) soften probability distributions, revealing which wrong answers the teacher considered plausible.
The Distillation Loss
Training combines two signals. Hard loss: cross-entropy with ground truth labels. Soft loss: KL divergence between student and teacher distributions at temperature T. The formula: L = α × hard_loss + (1-α) × T² × soft_loss. The T² compensates for gradient shrinkage at high temperatures. Typical settings: α=0.1-0.5, T=3-10.
Why Temperature Matters
At T=1, a confident teacher outputs [0.95, 0.03, 0.02]. The student learns only "pick class 1." At T=5, the same logits become [0.45, 0.30, 0.25]. Now the student learns that classes 2 and 3 relate to each other. This similarity structure transfers semantic knowledge. A dog classifier at high temperature reveals "husky" and "malamute" are more similar than either is to "poodle."
Transfer Sets
Distillation often works better with transfer sets: unlabeled data that the teacher labels with soft predictions. Benefits: larger datasets, reduced overfitting (no exact training duplicates), targeted examples for difficult cases. Downside: generating teacher predictions on millions of examples adds cost. Common pattern: cache teacher outputs to disk, train student from cached soft labels.
⚠️ Tuning: Start T=4. Increase if student plateaus early. Decrease if accuracy drops.

💡 Key Takeaways

✓Temperature (3-20) softens outputs to reveal class similarity structure, not just final predictions

✓Loss combines hard labels (α) and soft knowledge (1-α) with T² scaling for gradient compensation

✓High temperature transfers semantic relationships: similar classes get similar probabilities

✓Transfer sets (unlabeled data + teacher predictions) often outperform original training data

✓Cache teacher predictions to disk for large-scale distillation to avoid redundant inference

📌 Interview Tips

1Explain T² scaling in the loss function - it compensates for gradient shrinkage and shows mathematical depth

2Mention transfer sets as alternative to training data - demonstrates production experience with distillation

3Describe temperature tuning: start T=4, adjust based on student learning dynamics

← Back to Knowledge Distillation Overview