Model Monitoring & Observability • Prediction Drift MonitoringMedium⏱️ ~3 min
Statistical Metrics for Prediction Drift Detection
Choosing the right statistical distance metric is critical for effective drift detection. The metric must be sensitive enough to catch real issues but stable enough to avoid false alarms from natural variance. Different prediction types and use cases call for different metrics.
For continuous predictions like regression outputs or ranking scores, Kolmogorov Smirnov (KS) test and Wasserstein distance work well. KS measures the maximum distance between cumulative distribution functions and is particularly sensitive to shifts in the middle of the distribution. Wasserstein distance (also called Earth Mover's Distance) captures how much probability mass needs to move and is robust to small shifts. For binned or discrete outputs, Jensen Shannon (JS) divergence and Kullback Leibler (KL) divergence are standard choices. JS divergence is symmetric and bounded between 0 and 1, making threshold setting intuitive. A JS divergence above 0.1 typically indicates meaningful drift. In credit risk and financial services, Population Stability Index (PSI) is the industry standard, with values under 0.1 considered negligible, 0.1 to 0.25 moderate requiring watchlists, and above 0.25 triggering retraining or fallback rules.
The computational efficiency matters at scale. Major streaming platforms process billions of predictions daily across thousands of models. Computing JS divergence on 100 bin histograms takes under 200 milliseconds on a single CPU core. With 1 percent sampling on 10 million predictions per minute, each model processes about 100 thousand samples per minute. Per slice divergence computation completes fast enough for 5 minute alerting windows. The key technique is pre-aggregating predictions into histograms rather than storing raw values, reducing storage by 100 to 1000 times while maintaining detection sensitivity.
💡 Key Takeaways
•Jensen Shannon divergence is symmetric and bounded 0 to 1, making it intuitive for threshold setting. JS above 0.1 indicates meaningful drift in production systems
•Population Stability Index (PSI) is standard in credit risk: under 0.1 is negligible, 0.1 to 0.25 requires watchlists, above 0.25 triggers retraining. Weekly batch runs on 10 million applications complete in under 30 minutes
•Kolmogorov Smirnov test detects shifts in continuous distributions and is sensitive to middle range changes, ideal for regression and ranking score monitoring
•Pre-aggregating into 100 bin histograms reduces storage by 100 to 1000 times. Computing JS divergence on histograms takes under 200 milliseconds per slice on single CPU core
•For imbalanced classifiers with positive rates under 1 percent, monitor both full probability distribution and predicted positive rate separately using binomial exact confidence intervals to catch tail shifts
📌 Examples
Streaming platform computes 100 bin equi-depth histograms of ranking scores every 5 minutes across thousands of models. At 1 percent sampling rate with 10 million predictions per minute, JS divergence computation completes in under 200 milliseconds per model
Credit scoring uses PSI with 10 score bins: PSI of 0.28 detected when pandemic shifted applicant risk profiles, triggering model retraining and compliance review
Ads CTR model with 0.5 percent historical positive rate detects 20 percent increase to 0.6 percent within 30 minutes using binomial test on 100 thousand sampled events, with false alarm rate near 5 percent