What is Knowledge Distillation?
THE CORE PROBLEM
Large models achieve high accuracy but are expensive to serve. A 1 billion parameter model might need 4 GPUs and cost $0.10 per 1,000 requests. A 100 million parameter model costs $0.01 per 1,000 requests but achieves lower accuracy. Knowledge distillation closes this gap: the small model trained with distillation often matches 95% or more of the large model accuracy at 10x lower cost.
WHY SOFT LABELS HELP
When classifying an image, a large model might output: 80% cat, 15% dog, 5% fox. The hard label just says cat. But the soft distribution tells the student that this image is somewhat dog-like and slightly fox-like. This extra information helps the student generalize better, especially on ambiguous or edge cases where the relationships between classes matter.
WHEN TO USE DISTILLATION
Use distillation when you have a high quality teacher available, serving cost matters, and you can afford the one time training cost. It is most effective when the student is 5 to 20x smaller than the teacher. Below 5x, just use the smaller model directly. Above 20x, the gap is too large to bridge.