Data Drift Detection and Automated Retraining

Definitions
Data drift: Input feature distributions shift, degrading performance because training data no longer reflects production. Concept drift: Feature distributions stay stable but the relationship between features and labels changes.
DETECTING DATA DRIFT
PSI (Population Stability Index) compares feature distributions between baseline (last month) and current by bucketing values and computing weighted log ratio. PSI > 0.2 signals significant drift. Example: fraud model sees PSI 0.25 on transaction amount during holiday shopping—may miss novel patterns never seen in training.
DETECTING CONCEPT DRIFT
More insidious: features look stable but label relationship changes. Credit risk model sees stable income/debt features but defaults jump from predicted 5% to actual 12% during economic downturn. Detection requires monitoring calibration curves and ECE (Expected Calibration Error) with delay for label lag.
AUTOMATED RETRAINING
Trigger training when drift exceeds thresholds: PSI > 0.2 on 3 key features for 2 consecutive days, or MAP drops > 3% for 3 days → train on most recent 14-day window.
⚠️ Warning: Retraining on every drift signal causes model thrash. New model must beat production by pre-agreed margin (AUC +0.5 points) and pass stability checks before deploying.
HANDLING SEASONALITY
Retail models see predictable PSI spikes every weekend and December. Use seasonal baselines (compare weekend to last 4 weekends), exponentially weighted moving averages, or statistical tests with correction for multiple comparisons. Compute drift per slice to catch localized shifts global metrics miss.

💡 Key Takeaways

✓Population Stability Index (PSI) compares feature distributions across time periods, PSI greater than 0.2 signals significant drift requiring investigation, computed by bucketing values and summing weighted log ratios across buckets

✓Concept drift changes feature to label relationships without shifting input distributions: A 2019 credit model sees stable income and debt features but default rate jumps from 5 percent to 12 percent in downturn, breaking calibration

✓Automated retraining triggers on drift thresholds (PSI greater than 0.2 on 3 features for 2 days or MAP drop greater than 3 percent for 3 days) but requires promotion gates to prevent model thrash from retraining on noise

✓Promotion gates enforce improvement: Candidate must beat production model by AUC greater than 0.5 points, calibration slope within 0.02, no slice precision drop greater than 5 percent before deploying to avoid unstable model churn

✓Seasonality causes false positives: Retail models see predictable PSI spikes every weekend and December, require seasonal baselines (compare to last 4 weekends, not weekdays) or exponentially weighted moving averages to adapt

✓Label lag delays concept drift detection: Fraud labels arrive 7 days after prediction, credit defaults appear in 30 to 90 days, requires buffering performance metrics and using proxy signals (dispute rate, customer service calls) for faster feedback

📌 Interview Tips

1Uber fraud model drift: Monitors PSI on transaction_amount, merchant_category, and hour_of_day features hourly, triggers retraining if PSI greater than 0.25 on 2 features for 24 hours, retrains on last 14 days (8TB data), promotes only if AUC improves by 0.5 points on validation week

2Netflix recommendation concept drift: Pandemic shifts viewing patterns, PSI on genre features stays under 0.15 but watch time prediction Mean Absolute Error (MAE) climbs 15 percent, triggers retraining, new model trained on last 30 days improves MAE by 10 percent and is promoted

3Google Search ranking seasonal handling: Black Friday queries show PSI of 0.4 on shopping related features, but this is expected annually, system uses historical quantile comparison (current Black Friday vs last 3 years) and suppresses drift alerts during known seasonal windows

4Meta ad ranking slice specific drift: Global PSI is 0.12 (acceptable), but mobile iOS 16 users in Europe show PSI of 0.28 on click probability features after OS update, slice level monitoring catches this, triggers targeted retraining for that segment

← Back to CI/CD for ML Overview