ML Model OptimizationKnowledge DistillationHard⏱️ ~3 min

Failure Modes: Capacity Mismatch, Bias Amplification, and Distribution Drift

Key Insight
Capacity mismatch is the most common distillation failure. A student too small cannot represent the teacher"s knowledge; a student too large overfits to teacher errors rather than generalizing.

Capacity Mismatch Patterns

When the student is undersized, symptoms include: training loss decreases but validation accuracy plateaus early; student predictions become overconfident (high accuracy on easy examples, random on hard ones); the student learns coarse patterns but misses fine-grained distinctions. Rule of thumb: student should be 20-50% of teacher parameters for classification, 10-30% for language models. Below 10%, expect significant degradation.

Bias Amplification

Students can amplify teacher biases. If the teacher is 60% accurate on minority classes versus 95% on majority classes, the student might become 40% versus 93%. This happens because soft labels from confident majority predictions dominate the gradient signal. Mitigation: oversample minority classes in the transfer set, use class-balanced temperature (higher T for minority classes), or add explicit fairness constraints to the loss.

Distribution Drift

The student is trained on teacher predictions from a fixed data snapshot. When deployed, real data drifts. The student has no mechanism to update from new patterns because it never learned from raw labels. Signs: accuracy degrades faster than the teacher would on new data; confident wrong predictions on novel patterns. Fix: periodic re-distillation from updated teacher, or hybrid training that includes some hard labels.

💡 Detection: Monitor student-teacher agreement over time. Divergence above 5% on held-out data indicates drift or capacity issues requiring intervention.
💡 Key Takeaways
Capacity mismatch: student 20-50% of teacher params for classification, 10-30% for LLMs; below 10% expect degradation
Undersized students show early plateau, overconfidence on easy examples, and miss fine-grained distinctions
Bias amplification: students can worsen teacher biases on minority classes due to gradient imbalance
Distribution drift: students trained on fixed snapshots degrade faster than teachers on novel data
Monitor student-teacher agreement; divergence above 5% signals capacity or drift problems
📌 Interview Tips
1Discuss capacity ratios (20-50% for classification) when sizing student models - shows practical experience
2Mention bias amplification risk and mitigations (oversampling, class-balanced temperature) for fairness questions
3Explain distribution drift monitoring - student-teacher agreement tracking demonstrates production awareness
← Back to Knowledge Distillation Overview