Three Transfer Granularities: Response, Feature, and Relation Based Distillation

RESPONSE BASED DISTILLATION
The most common approach: train the student to match the teacher final layer output. For each training input, run both teacher and student, then minimize the distance between their output distributions. The loss function typically combines cross entropy with ground truth labels and KL divergence (a measure of how different two probability distributions are) with teacher outputs: loss = α × hard_loss + (1-α) × soft_loss. Typical alpha is 0.5 to 0.9.
TEMPERATURE SCALING
Teacher outputs are often too confident: 99.9% for the correct class, near zero for others. This hides information about class relationships. Temperature scaling softens the distribution: divide logits (the values before softmax that converts to probabilities) by temperature T before applying softmax. T=1 is normal, T=5 spreads probability more evenly. Higher temperature reveals more teacher knowledge but may transfer noise. T=3 to T=5 works well for most tasks.
FEATURE BASED DISTILLATION
Instead of matching only final outputs, match intermediate representations. Force student hidden layers to resemble teacher hidden layers. This transfers the teacher internal structure, not just its predictions. Useful when teacher and student have similar architectures. Requires a projection layer if dimensions differ.
⚠️ Trade-off: Feature distillation is more complex to implement and tune but often achieves better results than response-only distillation, especially for smaller students.
TRAINING DATA
You can distill on the original training data or on unlabeled data with teacher generated labels. Unlabeled data often improves results because it provides more diverse examples. The teacher effectively labels this extra data for free.

💡 Key Takeaways

✓Response distillation: match teacher final output using KL divergence plus hard label loss

✓Loss combines hard and soft: α × hard_loss + (1-α) × soft_loss, typical α = 0.5-0.9

✓Temperature T=3-5 softens confident outputs to reveal more class relationship information

✓Feature distillation matches intermediate layers for deeper knowledge transfer

✓Can distill on unlabeled data - teacher labels it for free, adding diversity

📌 Interview Tips

1Explain temperature: T=1 gives 99.9% confident, T=5 spreads to 70%/15%/10%/5% revealing structure

2Walk through loss: α=0.7 means 70% weight on ground truth, 30% on matching teacher

3Discuss unlabeled data: 1M labeled + 10M unlabeled with teacher labels often beats 1M labeled alone

← Back to Knowledge Distillation Overview