Drift Detection and Staleness Budgets
Types of Drift
Drift detection is the monitoring system that decides when to retrain. Three types of drift matter in production. Data drift occurs when input feature distributions shift (users start browsing on mobile instead of desktop, new product categories launch). Concept drift happens when the relationship between features and labels changes (click through rates drop during economic downturns even with same content). Label shift means the distribution of outcomes changes (fraud attacks concentrate on high value transactions).
Staleness Budgets
Staleness budgets formalize acceptable delays. Define SLOs like "features updated within 5 minutes," "model trained on last 14 days of data," and "refreshed every 24 hours." Uber enforces these budgets across thousands of models: real time ride matching features must update within minutes, while pricing models can tolerate daily refreshes.
Population Stability Index
The key metric is Population Stability Index (PSI), which measures distribution shift. PSI values above 0.1 indicate minor drift, above 0.2 significant drift requiring investigation, and above 0.25 major drift triggering automatic retrain. Set thresholds with hysteresis to avoid false positives from transient spikes.
Multi-Signal Monitoring
In practice, require drift sustained over sufficient sample size before triggering retrain, preventing "retraining storms" from temporary traffic anomalies. Netflix monitors multiple signals simultaneously: feature distributions via KS tests, prediction calibration error, and business metrics. Only trigger when multiple signals align and effect size exceeds thresholds.