ML Model OptimizationKnowledge DistillationHard⏱️ ~3 min

Training Recipe: Loss Functions, Temperature, and Data Pipelines

Core Concept
Temperature controls how much the student learns from teacher uncertainty. Higher values (3-20) soften probability distributions, revealing which wrong answers the teacher considered plausible.

The Distillation Loss

Training combines two signals. Hard loss: cross-entropy with ground truth labels. Soft loss: KL divergence between student and teacher distributions at temperature T. The formula: L = α × hard_loss + (1-α) × T² × soft_loss. The T² compensates for gradient shrinkage at high temperatures. Typical settings: α=0.1-0.5, T=3-10.

Why Temperature Matters

At T=1, a confident teacher outputs [0.95, 0.03, 0.02]. The student learns only "pick class 1." At T=5, the same logits become [0.45, 0.30, 0.25]. Now the student learns that classes 2 and 3 relate to each other. This similarity structure transfers semantic knowledge. A dog classifier at high temperature reveals "husky" and "malamute" are more similar than either is to "poodle."

Transfer Sets

Distillation often works better with transfer sets: unlabeled data that the teacher labels with soft predictions. Benefits: larger datasets, reduced overfitting (no exact training duplicates), targeted examples for difficult cases. Downside: generating teacher predictions on millions of examples adds cost. Common pattern: cache teacher outputs to disk, train student from cached soft labels.

⚠️ Tuning: Start T=4. Increase if student plateaus early. Decrease if accuracy drops.
💡 Key Takeaways
Temperature (3-20) softens outputs to reveal class similarity structure, not just final predictions
Loss combines hard labels (α) and soft knowledge (1-α) with T² scaling for gradient compensation
High temperature transfers semantic relationships: similar classes get similar probabilities
Transfer sets (unlabeled data + teacher predictions) often outperform original training data
Cache teacher predictions to disk for large-scale distillation to avoid redundant inference
📌 Interview Tips
1Explain T² scaling in the loss function - it compensates for gradient shrinkage and shows mathematical depth
2Mention transfer sets as alternative to training data - demonstrates production experience with distillation
3Describe temperature tuning: start T=4, adjust based on student learning dynamics
← Back to Knowledge Distillation Overview