ML Model OptimizationKnowledge DistillationMedium⏱️ ~3 min

Three Transfer Granularities: Response, Feature, and Relation Based Distillation

RESPONSE BASED DISTILLATION

The most common approach: train the student to match the teacher final layer output. For each training input, run both teacher and student, then minimize the distance between their output distributions. The loss function typically combines cross entropy with ground truth labels and KL divergence (a measure of how different two probability distributions are) with teacher outputs: loss = α × hard_loss + (1-α) × soft_loss. Typical alpha is 0.5 to 0.9.

TEMPERATURE SCALING

Teacher outputs are often too confident: 99.9% for the correct class, near zero for others. This hides information about class relationships. Temperature scaling softens the distribution: divide logits (the values before softmax that converts to probabilities) by temperature T before applying softmax. T=1 is normal, T=5 spreads probability more evenly. Higher temperature reveals more teacher knowledge but may transfer noise. T=3 to T=5 works well for most tasks.

FEATURE BASED DISTILLATION

Instead of matching only final outputs, match intermediate representations. Force student hidden layers to resemble teacher hidden layers. This transfers the teacher internal structure, not just its predictions. Useful when teacher and student have similar architectures. Requires a projection layer if dimensions differ.

⚠️ Trade-off: Feature distillation is more complex to implement and tune but often achieves better results than response-only distillation, especially for smaller students.

TRAINING DATA

You can distill on the original training data or on unlabeled data with teacher generated labels. Unlabeled data often improves results because it provides more diverse examples. The teacher effectively labels this extra data for free.

💡 Key Takeaways
Response distillation: match teacher final output using KL divergence plus hard label loss
Loss combines hard and soft: α × hard_loss + (1-α) × soft_loss, typical α = 0.5-0.9
Temperature T=3-5 softens confident outputs to reveal more class relationship information
Feature distillation matches intermediate layers for deeper knowledge transfer
Can distill on unlabeled data - teacher labels it for free, adding diversity
📌 Interview Tips
1Explain temperature: T=1 gives 99.9% confident, T=5 spreads to 70%/15%/10%/5% revealing structure
2Walk through loss: α=0.7 means 70% weight on ground truth, 30% on matching teacher
3Discuss unlabeled data: 1M labeled + 10M unlabeled with teacher labels often beats 1M labeled alone
← Back to Knowledge Distillation Overview