Three Transfer Granularities: Response, Feature, and Relation Based Distillation
RESPONSE BASED DISTILLATION
The most common approach: train the student to match the teacher final layer output. For each training input, run both teacher and student, then minimize the distance between their output distributions. The loss function typically combines cross entropy with ground truth labels and KL divergence (a measure of how different two probability distributions are) with teacher outputs: loss = α × hard_loss + (1-α) × soft_loss. Typical alpha is 0.5 to 0.9.
TEMPERATURE SCALING
Teacher outputs are often too confident: 99.9% for the correct class, near zero for others. This hides information about class relationships. Temperature scaling softens the distribution: divide logits (the values before softmax that converts to probabilities) by temperature T before applying softmax. T=1 is normal, T=5 spreads probability more evenly. Higher temperature reveals more teacher knowledge but may transfer noise. T=3 to T=5 works well for most tasks.
FEATURE BASED DISTILLATION
Instead of matching only final outputs, match intermediate representations. Force student hidden layers to resemble teacher hidden layers. This transfers the teacher internal structure, not just its predictions. Useful when teacher and student have similar architectures. Requires a projection layer if dimensions differ.
TRAINING DATA
You can distill on the original training data or on unlabeled data with teacher generated labels. Unlabeled data often improves results because it provides more diverse examples. The teacher effectively labels this extra data for free.