ML Infrastructure & MLOpsCI/CD for MLHard⏱️ ~3 min

Data Drift Detection and Automated Retraining

Data drift occurs when input feature distributions shift over time, degrading model performance because the training data no longer reflects production reality. Population Stability Index (PSI) and Kolmogorov Smirnov (KS) statistic are common drift metrics. PSI compares feature distributions between a baseline period (for example last month) and current period by bucketing values and computing a weighted log ratio. A PSI greater than 0.2 signals significant drift. For example, if a fraud model trained on pre holiday shopping patterns sees PSI of 0.25 on transaction amount and merchant category features during Cyber Monday, the model may miss novel fraud patterns because it was never exposed to that distribution. Concept drift is more insidious: Feature distributions may stay stable, but the relationship between features and labels changes. A credit risk model trained in 2019 may see stable income and debt features but the label distribution shifts during an economic downturn, causing calibration to break. The model predicts 5 percent default risk but observes 12 percent actual defaults, leading to massive financial exposure. Detecting concept drift requires monitoring model calibration curves, Expected Calibration Error (ECE), and label adjusted performance metrics with a delay to account for label lag (for example fraud labels may arrive 7 days after prediction). Automated retraining triggers when drift exceeds thresholds or when online metrics degrade. For example, if PSI greater than 0.2 on 3 key features for 2 consecutive days or Mean Average Precision (MAP) drops by more than 3 percent for 3 days, trigger a training job on the most recent 14 day data window. However, retraining on every drift signal causes model thrash and instability. The fix is a promotion gate: The newly trained model must beat the current production model on a validation set by a pre agreed margin (for example AUC improvement greater than 0.5 points) and pass stability checks (calibration slope within 0.02, no slice with precision drop greater than 5 percent) before deploying. Seasonality complicates drift detection. A retail recommendation model sees predictable PSI spikes every weekend and every December. Naive thresholds trigger false positives. Solutions include seasonal baselines (compare current weekend to last 4 weekends, not to weekdays), exponentially weighted moving averages that adapt to trends, or statistical tests with Bonferroni correction for multiple comparisons. Google's Continuous Evaluation framework and Uber's monitoring systems use time windowed comparison to historical quantiles and suppress alerts during known seasonal windows. Meta's learning platforms compute drift per traffic slice (mobile vs web, geo region, user tenure cohort) to catch localized shifts that global metrics miss.
💡 Key Takeaways
Population Stability Index (PSI) compares feature distributions across time periods, PSI greater than 0.2 signals significant drift requiring investigation, computed by bucketing values and summing weighted log ratios across buckets
Concept drift changes feature to label relationships without shifting input distributions: A 2019 credit model sees stable income and debt features but default rate jumps from 5 percent to 12 percent in downturn, breaking calibration
Automated retraining triggers on drift thresholds (PSI greater than 0.2 on 3 features for 2 days or MAP drop greater than 3 percent for 3 days) but requires promotion gates to prevent model thrash from retraining on noise
Promotion gates enforce improvement: Candidate must beat production model by AUC greater than 0.5 points, calibration slope within 0.02, no slice precision drop greater than 5 percent before deploying to avoid unstable model churn
Seasonality causes false positives: Retail models see predictable PSI spikes every weekend and December, require seasonal baselines (compare to last 4 weekends, not weekdays) or exponentially weighted moving averages to adapt
Label lag delays concept drift detection: Fraud labels arrive 7 days after prediction, credit defaults appear in 30 to 90 days, requires buffering performance metrics and using proxy signals (dispute rate, customer service calls) for faster feedback
📌 Examples
Uber fraud model drift: Monitors PSI on transaction_amount, merchant_category, and hour_of_day features hourly, triggers retraining if PSI greater than 0.25 on 2 features for 24 hours, retrains on last 14 days (8TB data), promotes only if AUC improves by 0.5 points on validation week
Netflix recommendation concept drift: Pandemic shifts viewing patterns, PSI on genre features stays under 0.15 but watch time prediction Mean Absolute Error (MAE) climbs 15 percent, triggers retraining, new model trained on last 30 days improves MAE by 10 percent and is promoted
Google Search ranking seasonal handling: Black Friday queries show PSI of 0.4 on shopping related features, but this is expected annually, system uses historical quantile comparison (current Black Friday vs last 3 years) and suppresses drift alerts during known seasonal windows
Meta ad ranking slice specific drift: Global PSI is 0.12 (acceptable), but mobile iOS 16 users in Europe show PSI of 0.28 on click probability features after OS update, slice level monitoring catches this, triggers targeted retraining for that segment
← Back to CI/CD for ML Overview