Model Monitoring & ObservabilityModel Performance Degradation & AlertingHard⏱️ ~3 min

Statistical Methods for Drift Detection and Alerting

Drift detection relies on statistical tests that quantify distribution divergence between training and production data. Three methods dominate production systems, each with specific use cases and thresholds derived from years of industry practice. Population Stability Index (PSI) is the workhorse for monitoring many features simultaneously. It bins a feature into 10 to 20 buckets, compares production proportions against training baseline proportions, and computes a weighted divergence score. PSI below 0.1 means no significant shift. Between 0.1 and 0.25 indicates moderate drift warranting review. Above 0.25 signals major change requiring action. Google Ads uses PSI on hundreds of features with 5 minute windows and 10,000 sample minimums per feature, alerting when any top 50 features by importance exceed 0.25 for two consecutive windows. The method is simple, interpretable, and handles categorical and binned numeric features equally. Kolmogorov Smirnov (KS) test measures maximum distance between cumulative distribution functions for continuous features. A p value below 0.05 indicates significant divergence at 95% confidence. Netflix applies KS tests to score distributions and key numeric features like viewing time and session counts, using 50,000 event windows refreshed every 5 minutes. The test is nonparametric and sensitive to shape changes, but requires sufficient samples. They combine it with effect size checks, only alerting when KS distance exceeds 0.1 AND p value is below 0.01, filtering noise while catching meaningful shifts. Jensen Shannon Divergence (JSD) quantifies information theoretic distance between distributions, bounded between 0 and 1. It is symmetric, unlike Kullback Leibler, and works well for comparing histograms or probability distributions. Uber uses JSD on GPS coordinate distributions and demand density histograms with thresholds at 0.15 for warnings and 0.3 for critical alerts. For high dimensional feature spaces, they compute JSD on Principal Component Analysis (PCA) projections to reduce dimensionality while preserving drift signal. Thresholds are domain specific, calibrated by replaying historical incidents to find levels that would have detected real issues 48 hours before user impact.
💡 Key Takeaways
PSI is production proven for broad monitoring. Twitter monitors 300 plus features per model with PSI computed hourly, alerting ML engineers when 3 or more features exceed 0.25 simultaneously, indicating coordinated drift from upstream changes.
KS test needs minimum sample sizes for reliability. Instagram requires 10,000 events per feature per window before computing KS statistics, avoiding false positives from small sample noise while detecting shifts within 30 minutes at their traffic scale.
JSD handles multimodal distributions better than KS. Pinterest uses JSD on user interest embeddings projected to 50 dimensions, detecting when new content categories emerge and shift the embedding space by more than 0.2 divergence units.
Calibration monitoring catches prediction quality drift. Brier score measures squared error between predicted probabilities and outcomes. Facebook ads tracks Brier score by decile, alerting when top decile calibration degrades from 0.05 to 0.08 error over 24 hours.
Expected Calibration Error (ECE) quantifies reliability. Divide predictions into 10 buckets, compare mean predicted probability to observed rate per bucket, and compute weighted average error. Stripe fraud models maintain ECE below 0.03, alerting at 0.05 threshold.
Change point detection using Page Hinkley test identifies regime shifts. Uber applies it to city level demand patterns, detecting when a new competitor launch or transit change causes sustained 15% demand drops in specific zones within 6 hours.
📌 Examples
DoorDash monitors restaurant feature drift with PSI on 80 features every 15 minutes using 20,000 order windows. When PSI exceeded 0.3 on prep time features for 3 windows, investigation found a partner API change that shifted time units from minutes to seconds, fixed within 90 minutes.
Spotify recommendation system uses KS tests on audio feature distributions like tempo and energy with 100,000 track windows per hour. When KS distance jumped from 0.05 to 0.18 on energy features, they discovered a data pipeline was accidentally filtering out high energy tracks, causing a 12% drop in workout playlist engagement.
Zillow home price model computes JSD on price per square foot distributions by zip code daily with 500 home minimums. JSD increase from 0.08 to 0.35 in 50 zip codes over one week correctly detected the start of a localized market correction, triggering model retraining 3 weeks before monthly scheduled retrain.
← Back to Model Performance Degradation & Alerting Overview