ML Model Optimization • Knowledge DistillationHard⏱️ ~3 min
Training Recipe: Loss Functions, Temperature, and Data Pipelines
The distillation training recipe combines ground truth supervision with teacher knowledge through a weighted composite loss. The standard formulation is L equals alpha times cross entropy with hard labels plus beta times KL divergence between student and teacher soft outputs, with alpha plus beta typically summing to 1. The KL term is scaled by temperature squared (T squared) to keep gradient magnitudes comparable, following the original Hinton formulation. Typical hyperparameter grids sweep T in 2, 4, 8, 12 and alpha, beta in 0.1 to 0.9 pairs, with common production values around T equals 5, alpha equals 0.3, beta equals 0.7.
Temperature choice is critical. Too low, like T equals 1, makes soft targets nearly one hot and removes the dark knowledge benefit. Too high, like T equals 50, creates nearly uniform distributions that provide weak signal. The sweet spot depends on task and teacher confidence. For well calibrated teachers on focused tasks, T equals 3 to 5 works well. For overconfident teachers or broad multi label problems, T equals 10 to 20 can help. Monitor validation loss components separately: if distillation loss plateaus early while cross entropy keeps improving, the teacher signal may be too weak or temperature too high.
Data pipeline complexity scales with distillation approach. For response based distillation over 100 million examples with 1000 classes, storing full float16 probability vectors takes 200 GB. Instead, store top 50 probabilities with indices and a normalization constant, reducing to under 10 GB. Run teacher inference in batch on graphics processing unit clusters, generating soft targets at 10,000 to 50,000 examples per graphics processing unit hour depending on model size. For feature based distillation, also serialize intermediate activations, adding storage but improving student quality by 2 to 5 percent accuracy. For relation based methods, compute pairwise similarities on batches during student training rather than precomputing, since the O(n squared) storage cost is prohibitive.
Stability techniques matter for convergence. Use label smoothing on hard targets to prevent overconfidence, typically 0.1 to 0.2 smoothing factor. Apply learning rate warmup for first 5 to 10 percent of steps to avoid early instability from mismatched teacher student outputs. For feature distillation, add small projection networks to match dimensions if teacher and student hidden sizes differ, training these jointly. If you have black box application programming interface access only, use active learning to focus expensive teacher queries on regions where the student is uncertain, reducing teacher calls by 50 to 70 percent while maintaining quality.
💡 Key Takeaways
•Composite loss with alpha equals 0.3 and beta equals 0.7 balances hard label fitting and teacher mimicking, temperature T equals 5 typical sweet spot for most tasks
•Temperature squared scaling in KL term keeps gradient magnitudes stable, prevents teacher signal from dominating, derived from matching gradient norms in original Hinton paper
•Data efficiency: store top 50 probabilities instead of full 1000 class vectors reduces storage from 200 GB to under 10 GB for 100 million example corpus
•Teacher inference at 10,000 to 50,000 examples per graphics processing unit hour on batch clusters, feature distillation adds 2 to 5 percent accuracy but doubles storage for intermediate activations
•Active learning for black box distillation reduces teacher application programming interface calls by 50 to 70 percent, focusing queries where student uncertainty is highest
📌 Examples
Training DistilBERT: alpha equals 0.3, beta equals 0.7, T equals 4, learning rate 5e minus 5 with linear warmup over 10,000 steps, converges in 3 days on 8 graphics processing units over 100 million sequences
Feature matching setup: add 2 layer multilayer perceptron projection from student 384 dimension hidden states to teacher 768 dimensions, use cosine similarity loss with weight 0.1 combined with response distillation
Black box large language model distillation: sample 1 million diverse prompts, query teacher application programming interface at $0.002 per call ($2000 total), train student on prompt response pairs plus active sampled hard examples, achieve 93 percent teacher quality