Failure Modes: Capacity Mismatch, Bias Amplification, and Distribution Drift
Capacity Mismatch Patterns
When the student is undersized, symptoms include: training loss decreases but validation accuracy plateaus early; student predictions become overconfident (high accuracy on easy examples, random on hard ones); the student learns coarse patterns but misses fine-grained distinctions. Rule of thumb: student should be 20-50% of teacher parameters for classification, 10-30% for language models. Below 10%, expect significant degradation.
Bias Amplification
Students can amplify teacher biases. If the teacher is 60% accurate on minority classes versus 95% on majority classes, the student might become 40% versus 93%. This happens because soft labels from confident majority predictions dominate the gradient signal. Mitigation: oversample minority classes in the transfer set, use class-balanced temperature (higher T for minority classes), or add explicit fairness constraints to the loss.
Distribution Drift
The student is trained on teacher predictions from a fixed data snapshot. When deployed, real data drifts. The student has no mechanism to update from new patterns because it never learned from raw labels. Signs: accuracy degrades faster than the teacher would on new data; confident wrong predictions on novel patterns. Fix: periodic re-distillation from updated teacher, or hybrid training that includes some hard labels.