Training Recipe: Loss Functions, Temperature, and Data Pipelines
The Distillation Loss
Training combines two signals. Hard loss: cross-entropy with ground truth labels. Soft loss: KL divergence between student and teacher distributions at temperature T. The formula: L = α × hard_loss + (1-α) × T² × soft_loss. The T² compensates for gradient shrinkage at high temperatures. Typical settings: α=0.1-0.5, T=3-10.
Why Temperature Matters
At T=1, a confident teacher outputs [0.95, 0.03, 0.02]. The student learns only "pick class 1." At T=5, the same logits become [0.45, 0.30, 0.25]. Now the student learns that classes 2 and 3 relate to each other. This similarity structure transfers semantic knowledge. A dog classifier at high temperature reveals "husky" and "malamute" are more similar than either is to "poodle."
Transfer Sets
Distillation often works better with transfer sets: unlabeled data that the teacher labels with soft predictions. Benefits: larger datasets, reduced overfitting (no exact training duplicates), targeted examples for difficult cases. Downside: generating teacher predictions on millions of examples adds cost. Common pattern: cache teacher outputs to disk, train student from cached soft labels.