Model Monitoring & Observability • Concept Drift & Model DecayMedium⏱️ ~2 min
Detection Strategies: Monitoring Drift with Statistical Signals
Production drift detection relies on continuous statistical comparison between recent data windows and baseline references. Feature drift detection uses Population Stability Index (PSI), Jensen Shannon divergence, or Kolmogorov Smirnov tests. PSI is most common: compute PSI = sum((actual_pct minus expected_pct) times ln(actual_pct divided by expected_pct)) across bins. Practical thresholds are PSI less than 0.1 means stable, 0.1 to 0.2 flags concern, 0.2 to 0.3 indicates moderate drift requiring investigation, and above 0.3 signals severe drift demanding immediate action.
Performance drift monitoring uses prequential evaluation with sliding windows and error rate control charts. The Drift Detection Method (DDM) maintains p_min and s_min for misclassification rate p_t with standard deviation s_t. Enter warning state when p_t + s_t exceeds p_min + 2 times s_min. Declare drift when it exceeds p_min + 3 times s_min. For gradual drift, Early Drift Detection Method (EDDM) variants track distances between errors instead of error rates. At Meta and Google ads platforms handling 100k to 1M queries per second (QPS), teams also monitor calibration drift through Brier score and Expected Calibration Error (ECE) changes.
Window sizing balances statistical stability against reaction speed. Event based windows of 10k to 1M samples or time based windows of 15 to 60 minutes are typical. At 50k predictions per second, 1% sampling produces roughly 500 events per second. If each event log is 1 KB, storage reaches about 43 GB per day. High risk models keep full samples, others use lower rates or probabilistic sketches. Multi-signal gating reduces false positives: trigger retraining only when at least two of three conditions hold for sustained periods, such as PSI above 0.3 for 30 minutes AND 5 to 10% AUC drop AND error increase for critical slices.
💡 Key Takeaways
•PSI thresholds guide action: below 0.1 is stable, 0.1 to 0.2 requires monitoring, 0.2 to 0.3 needs investigation, above 0.3 demands immediate retraining. Track per feature and per slice like region or device.
•DDM for performance drift: Enter warning when p_t + s_t exceeds p_min + 2 s_min, declare drift at p_min + 3 s_min. This catches sudden shifts. EDDM variants using error distance work better for gradual drift.
•Sampling controls cost at scale: At 50k QPS, 1% sampling generates 500 events/sec or 43 GB/day at 1 KB per event. Adjust rate by model risk, use sketches for lower priority models.
•Multi-signal gating prevents false alarms: Require sustained breaches across multiple metrics. For example, PSI above 0.3 for 30 minutes AND 5% AUC drop AND slice degradation before triggering expensive retraining.
•Window sizing tradeoff: Small windows (15 minutes, 10k events) react fast but are noisy. Large windows (60 minutes, 1M events) are stable but slow. Uber uses 30 minute windows for ETA drift detection.
•Calibration drift is separate: Track Brier score and Expected Calibration Error (ECE). A model can maintain AUC but lose calibration, producing confidently wrong probabilities that break downstream bidding or ranking.
📌 Examples
Google and Meta ads CTR models: Monitor PSI per feature across country, device, and placement slices. Alert when PSI exceeds 0.3 for 30 minutes AND calibration error increases by 5% AND top advertiser slice degrades. This multi-signal gate reduces false positives from normal traffic variation.
Uber dispatch models: Compare recent travel time residuals (predicted minus actual) against a rolling 7 day baseline per region and time of day. Trigger when MAE worsens by more than 10% for 30 minutes. This sustained threshold avoids overreacting to temporary traffic incidents.
Stripe fraud detection: At tens of thousands QPS, sample 5% of transactions for full feature logging, 100% for prediction and outcome only. Run PSI tests on sampled data every 15 minutes. During attack waves, PSI jumps from 0.05 to 0.4+ within an hour, triggering hourly retraining.