ML Model OptimizationKnowledge DistillationHard⏱️ ~3 min

Validation and Monitoring: Beyond Accuracy to Calibration and Drift

Definition
Calibration measures whether predicted probabilities match actual outcomes. A calibrated model predicting 70% confidence should be correct 70% of the time. Distilled students often lose calibration even when accuracy is preserved.

Why Calibration Degrades

Soft labels at high temperature compress the probability range. A teacher confident at 0.95 becomes 0.65 after softening. The student learns to output values in this compressed range. At inference (T=1), predictions cluster around 0.6-0.8 rather than spanning 0.1-0.99. Result: the student is overconfident on uncertain examples and underconfident on clear ones. This matters for downstream decisions: a recommendation system using "show if confidence > 0.7" behaves incorrectly.

Monitoring Beyond Accuracy

Track these metrics alongside accuracy: Expected Calibration Error (ECE): bin predictions by confidence, measure gap between confidence and accuracy per bin. ECE below 0.05 is well-calibrated. Brier score: mean squared error of probability predictions. Lower is better; compare to teacher"s score. Agreement rate: how often student and teacher predictions match. Below 90% suggests capacity or training issues.

Calibration Recovery

Post-hoc calibration fixes the problem without retraining. Temperature scaling: learn a single T value on validation data that minimizes calibration error. Apply T to logits before softmax at inference. Platt scaling: fit a logistic regression on validation predictions. Both add negligible latency (one multiply or small linear layer). Always evaluate calibration on a held-out set separate from the scaling validation set.

✅ Production Check: Run calibration analysis before deploying any distilled model. Temperature scaling takes minutes and prevents confidence-based decision failures.
💡 Key Takeaways
Calibration measures if predicted probabilities match actual outcomes; distilled students often lose it
High-temperature training compresses probability range, causing over/underconfidence at inference
Track ECE (below 0.05 is good), Brier score, and student-teacher agreement rate (above 90%)
Temperature scaling and Platt scaling recover calibration post-hoc with negligible latency cost
Always validate calibration on held-out data separate from the scaling validation set
📌 Interview Tips
1Explain why calibration degrades (probability compression from temperature) - shows deep understanding
2Mention ECE threshold of 0.05 and how to compute it - specific metrics impress interviewers
3Describe temperature scaling as a post-hoc fix that takes minutes - practical production knowledge
← Back to Knowledge Distillation Overview