Training Infrastructure & PipelinesContinuous Training & Model RefreshMedium⏱️ ~3 min

Drift Detection and Staleness Budgets

Types of Drift

Drift detection is the monitoring system that decides when to retrain. Three types of drift matter in production. Data drift occurs when input feature distributions shift (users start browsing on mobile instead of desktop, new product categories launch). Concept drift happens when the relationship between features and labels changes (click through rates drop during economic downturns even with same content). Label shift means the distribution of outcomes changes (fraud attacks concentrate on high value transactions).

Staleness Budgets

Staleness budgets formalize acceptable delays. Define SLOs like "features updated within 5 minutes," "model trained on last 14 days of data," and "refreshed every 24 hours." Uber enforces these budgets across thousands of models: real time ride matching features must update within minutes, while pricing models can tolerate daily refreshes.

Population Stability Index

The key metric is Population Stability Index (PSI), which measures distribution shift. PSI values above 0.1 indicate minor drift, above 0.2 significant drift requiring investigation, and above 0.25 major drift triggering automatic retrain. Set thresholds with hysteresis to avoid false positives from transient spikes.

Multi-Signal Monitoring

In practice, require drift sustained over sufficient sample size before triggering retrain, preventing "retraining storms" from temporary traffic anomalies. Netflix monitors multiple signals simultaneously: feature distributions via KS tests, prediction calibration error, and business metrics. Only trigger when multiple signals align and effect size exceeds thresholds.

💡 Key Takeaways
Population Stability Index (PSI) quantifies distribution shift: PSI between 0.1 and 0.2 indicates minor drift, above 0.2 signals significant drift requiring action, and above 0.25 triggers automatic retrain in most production systems
Hysteresis prevents retraining storms: Airbnb requires drift sustained over 100,000 impressions before triggering, filtering out transient traffic spikes from product launches or marketing campaigns
Multi signal monitoring catches different failure modes: Netflix tracks feature distributions via Kolmogorov Smirnov tests, prediction calibration error (threshold 0.01), and business metrics like play start rate simultaneously
Staleness budgets formalize acceptable delays: streaming features updated within 1 to 15 minutes for behavioral signals, batch features within 24 hours for long horizon aggregates, models retrained when data age exceeds 7 to 28 days
False negatives are more dangerous than false positives: silent performance degradation loses revenue, so set conservative thresholds and monitor multiple metrics rather than relying on a single signal
📌 Interview Tips
1Uber fraud detection monitors transaction PSI, velocity features, and conversion rates with 15 minute windows, triggering retrain when PSI exceeds 0.25 or fraud rate spikes by 50% sustained over 2 hours
2Meta ad ranking tracks per advertiser segment drift separately because global metrics can hide tail regressions, enforcing per segment AUC ROC thresholds and triggering segmented retrains
← Back to Continuous Training & Model Refresh Overview
Drift Detection and Staleness Budgets | Continuous Training & Model Refresh - System Overflow