ML Model OptimizationKnowledge DistillationEasy⏱️ ~3 min

What is Knowledge Distillation?

Definition
Knowledge Distillation trains a small, fast model (the student) to mimic the behavior of a large, accurate model (the teacher). The student learns not just the correct answers but the teacher probability distribution over all possible answers, capturing richer information than labels alone.

THE CORE PROBLEM

Large models achieve high accuracy but are expensive to serve. A 1 billion parameter model might need 4 GPUs and cost $0.10 per 1,000 requests. A 100 million parameter model costs $0.01 per 1,000 requests but achieves lower accuracy. Knowledge distillation closes this gap: the small model trained with distillation often matches 95% or more of the large model accuracy at 10x lower cost.

WHY SOFT LABELS HELP

When classifying an image, a large model might output: 80% cat, 15% dog, 5% fox. The hard label just says cat. But the soft distribution tells the student that this image is somewhat dog-like and slightly fox-like. This extra information helps the student generalize better, especially on ambiguous or edge cases where the relationships between classes matter.

💡 Key Insight: The teacher output contains more information than binary labels. A confident 99% prediction means something different than an uncertain 51% prediction, even if both are correct.

WHEN TO USE DISTILLATION

Use distillation when you have a high quality teacher available, serving cost matters, and you can afford the one time training cost. It is most effective when the student is 5 to 20x smaller than the teacher. Below 5x, just use the smaller model directly. Above 20x, the gap is too large to bridge.

💡 Key Takeaways
Student learns to mimic teacher probability distribution, not just hard labels
Soft labels contain richer information: 80% cat, 15% dog tells more than just cat
Closes the cost-accuracy gap: 95%+ of large model accuracy at 10x lower serving cost
Most effective when student is 5-20x smaller than teacher
Requires one-time training cost but saves on every subsequent inference
📌 Interview Tips
1Explain soft labels: 80% cat, 15% dog, 5% fox teaches relationships between classes
2Discuss cost savings: 1B param model at $0.10/1K vs 100M at $0.01/1K after distillation
3Describe the sweet spot: student 5-20x smaller; below 5x just train smaller, above 20x gap too large
← Back to Knowledge Distillation Overview
What is Knowledge Distillation? | Knowledge Distillation - System Overflow