Validation and Monitoring: Beyond Accuracy to Calibration and Drift
Why Calibration Degrades
Soft labels at high temperature compress the probability range. A teacher confident at 0.95 becomes 0.65 after softening. The student learns to output values in this compressed range. At inference (T=1), predictions cluster around 0.6-0.8 rather than spanning 0.1-0.99. Result: the student is overconfident on uncertain examples and underconfident on clear ones. This matters for downstream decisions: a recommendation system using "show if confidence > 0.7" behaves incorrectly.
Monitoring Beyond Accuracy
Track these metrics alongside accuracy: Expected Calibration Error (ECE): bin predictions by confidence, measure gap between confidence and accuracy per bin. ECE below 0.05 is well-calibrated. Brier score: mean squared error of probability predictions. Lower is better; compare to teacher"s score. Agreement rate: how often student and teacher predictions match. Below 90% suggests capacity or training issues.
Calibration Recovery
Post-hoc calibration fixes the problem without retraining. Temperature scaling: learn a single T value on validation data that minimizes calibration error. Apply T to logits before softmax at inference. Platt scaling: fit a logistic regression on validation predictions. Both add negligible latency (one multiply or small linear layer). Always evaluate calibration on a held-out set separate from the scaling validation set.