Failure Modes: Capacity Mismatch, Bias Amplification, and Distribution Drift

Key Insight
Capacity mismatch is the most common distillation failure. A student too small cannot represent the teacher"s knowledge; a student too large overfits to teacher errors rather than generalizing.
Capacity Mismatch Patterns
When the student is undersized, symptoms include: training loss decreases but validation accuracy plateaus early; student predictions become overconfident (high accuracy on easy examples, random on hard ones); the student learns coarse patterns but misses fine-grained distinctions. Rule of thumb: student should be 20-50% of teacher parameters for classification, 10-30% for language models. Below 10%, expect significant degradation.
Bias Amplification
Students can amplify teacher biases. If the teacher is 60% accurate on minority classes versus 95% on majority classes, the student might become 40% versus 93%. This happens because soft labels from confident majority predictions dominate the gradient signal. Mitigation: oversample minority classes in the transfer set, use class-balanced temperature (higher T for minority classes), or add explicit fairness constraints to the loss.
Distribution Drift
The student is trained on teacher predictions from a fixed data snapshot. When deployed, real data drifts. The student has no mechanism to update from new patterns because it never learned from raw labels. Signs: accuracy degrades faster than the teacher would on new data; confident wrong predictions on novel patterns. Fix: periodic re-distillation from updated teacher, or hybrid training that includes some hard labels.
💡 Detection: Monitor student-teacher agreement over time. Divergence above 5% on held-out data indicates drift or capacity issues requiring intervention.

💡 Key Takeaways

✓Capacity mismatch: student 20-50% of teacher params for classification, 10-30% for LLMs; below 10% expect degradation

✓Undersized students show early plateau, overconfidence on easy examples, and miss fine-grained distinctions

✓Bias amplification: students can worsen teacher biases on minority classes due to gradient imbalance

✓Distribution drift: students trained on fixed snapshots degrade faster than teachers on novel data

✓Monitor student-teacher agreement; divergence above 5% signals capacity or drift problems

📌 Interview Tips

1Discuss capacity ratios (20-50% for classification) when sizing student models - shows practical experience

2Mention bias amplification risk and mitigations (oversampling, class-balanced temperature) for fairness questions

3Explain distribution drift monitoring - student-teacher agreement tracking demonstrates production awareness

← Back to Knowledge Distillation Overview