ML Model OptimizationKnowledge DistillationEasy⏱️ ~3 min

What is Knowledge Distillation?

Knowledge Distillation is a model compression technique that transfers knowledge from a large, accurate teacher model into a smaller, faster student model. Instead of training only on hard ground truth labels (one hot encoded), the student learns from the teacher's soft probability distribution across all classes. This soft output, called dark knowledge, reveals which incorrect classes the teacher considers plausible, encoding rich similarity information that improves student learning. The key mechanism uses temperature scaling on the teacher's logits before applying softmax. At temperature T equals 1, you get normal probabilities. At higher temperatures like T equals 5 or 10, the distribution becomes softer and more informative. For example, if a teacher classifying animals outputs 0.85 for dog, 0.10 for wolf, and 0.05 for cat, these relative similarities help the student learn that wolves are closer to dogs than cats are. The student is trained with a combined loss that includes both the hard label cross entropy and a distillation term that matches the teacher's soft distribution using Kullback-Leibler (KL) divergence. This approach acts as a powerful regularizer. The soft targets provide more informative gradients per example, especially on rare or ambiguous inputs where a single hard label gives limited signal. A typical result: Google reported distilling BERT base (110 million parameters) to DistilBERT (66 million parameters), retaining 97 percent of accuracy while achieving 60 percent faster inference. The technique is model agnostic and applies across natural language processing, computer vision, speech recognition, and ranking systems.
💡 Key Takeaways
Soft targets encode class similarities that hard one hot labels cannot express, providing richer training signal per example
Temperature parameter controls softness: T equals 2 to 20 typical in production, with higher values creating smoother distributions
Combined loss uses both ground truth cross entropy and KL divergence to teacher, weighted with alpha and beta hyperparameters typically around 0.3 and 0.7
DistilBERT example: 40 percent parameter reduction, 60 percent speedup, 97 percent accuracy retention demonstrates practical compression ratios
Works across domains including natural language processing transformers, computer vision convolutional neural networks, speech recognition recurrent neural networks, and ranking models
📌 Examples
Google uses distillation for on device speech recognition, compressing 100+ MB server models to under 20 MB mobile models while keeping word error rate within a few percent
Meta applies distillation to compress large natural language processing and vision ranking models for mobile applications where graphics processing unit acceleration is unavailable
OpenAI distills large language models through black box methods, using only input output pairs to train smaller students for tasks like safety classification at fraction of serving cost
← Back to Knowledge Distillation Overview
What is Knowledge Distillation? | Knowledge Distillation - System Overflow