Validation and Monitoring: Beyond Accuracy to Calibration and Drift

Definition
Calibration measures whether predicted probabilities match actual outcomes. A calibrated model predicting 70% confidence should be correct 70% of the time. Distilled students often lose calibration even when accuracy is preserved.
Why Calibration Degrades
Soft labels at high temperature compress the probability range. A teacher confident at 0.95 becomes 0.65 after softening. The student learns to output values in this compressed range. At inference (T=1), predictions cluster around 0.6-0.8 rather than spanning 0.1-0.99. Result: the student is overconfident on uncertain examples and underconfident on clear ones. This matters for downstream decisions: a recommendation system using "show if confidence > 0.7" behaves incorrectly.
Monitoring Beyond Accuracy
Track these metrics alongside accuracy: Expected Calibration Error (ECE): bin predictions by confidence, measure gap between confidence and accuracy per bin. ECE below 0.05 is well-calibrated. Brier score: mean squared error of probability predictions. Lower is better; compare to teacher"s score. Agreement rate: how often student and teacher predictions match. Below 90% suggests capacity or training issues.
Calibration Recovery
Post-hoc calibration fixes the problem without retraining. Temperature scaling: learn a single T value on validation data that minimizes calibration error. Apply T to logits before softmax at inference. Platt scaling: fit a logistic regression on validation predictions. Both add negligible latency (one multiply or small linear layer). Always evaluate calibration on a held-out set separate from the scaling validation set.
✅ Production Check: Run calibration analysis before deploying any distilled model. Temperature scaling takes minutes and prevents confidence-based decision failures.

💡 Key Takeaways

✓Calibration measures if predicted probabilities match actual outcomes; distilled students often lose it

✓High-temperature training compresses probability range, causing over/underconfidence at inference

✓Track ECE (below 0.05 is good), Brier score, and student-teacher agreement rate (above 90%)

✓Temperature scaling and Platt scaling recover calibration post-hoc with negligible latency cost

✓Always validate calibration on held-out data separate from the scaling validation set

📌 Interview Tips

1Explain why calibration degrades (probability compression from temperature) - shows deep understanding

2Mention ECE threshold of 0.05 and how to compute it - specific metrics impress interviewers

3Describe temperature scaling as a post-hoc fix that takes minutes - practical production knowledge

← Back to Knowledge Distillation Overview