ML Model Optimization • Knowledge DistillationMedium⏱️ ~3 min
Three Transfer Granularities: Response, Feature, and Relation Based Distillation
Knowledge distillation operates at three distinct granularities, each capturing different aspects of teacher knowledge. Response based distillation matches the final output probability distribution using KL divergence between teacher and student logits. This is the simplest approach and works with black box teacher access, making it practical when you can only query an application programming interface without internal access. For a 1000 class image classifier, you transfer the full softmax output or top 50 probabilities to save storage when distilling over 100 million examples.
Feature based distillation matches intermediate layer activations or attention maps. This becomes critical when teacher and student architectures differ significantly in depth or inductive bias. For example, distilling a 12 layer BERT teacher into a 6 layer student benefits from matching hidden states at corresponding layers, not just final outputs. You typically add projection heads to align dimensions, then minimize cosine distance or L2 loss between teacher and student feature maps. Meta has reported using feature matching when compressing vision models from residual networks to mobile efficient architectures, preserving representation quality that pure output matching would lose.
Relation based distillation preserves pairwise or higher order relationships across samples, such as distances in embedding space or similarity matrices. This matters for retrieval and ranking tasks where global structure is as important as individual predictions. For a text embedding model serving 10,000 queries per second, you might compute pairwise cosine similarities for batches of 64 examples from the teacher, then train the student to match this 64 by 64 similarity matrix. This preserves ranking quality better than matching outputs independently. The tradeoff is computational cost: relation based methods scale quadratically with batch size and require careful sampling strategies in production pipelines.
💡 Key Takeaways
•Response based uses only input output pairs and works with application programming interface access, storing top 50 probabilities per example instead of full 1000 class distributions saves 95 percent storage
•Feature based requires internal access but preserves representation quality when architectures differ, using cosine or L2 loss between aligned hidden states
•Relation based distillation for ranking tasks trains on pairwise similarities across batches, scaling quadratically with batch size but preserving global structure
•Combining granularities often works best: response based for final outputs plus feature based for key intermediate layers captures both local and global knowledge
•Black box distillation limits you to response based only, losing 10 to 15 percent of potential compression benefit compared to white box feature access
📌 Examples
Distilling 12 layer BERT to 6 layer DistilBERT: match outputs plus hidden states at layers 3, 6, 9, 12 mapped to student layers 2, 4, 6 using cosine similarity loss
Text embedding model serving 10,000 queries per second: compute 64 by 64 similarity matrices over batches, train student to match with Frobenius norm loss, improves Mean Average Precision by 8 percent over response only
Apple compressing server speech model to on device: feature distillation from recurrent neural network teacher to compact transformer student preserves acoustic representation quality under 20 MB size constraint