Training Infrastructure & Pipelines • Continuous Training & Model RefreshMedium⏱️ ~3 min
Drift Detection and Staleness Budgets
Drift detection is the monitoring system that decides when to retrain. Three types of drift matter in production. Data drift occurs when input feature distributions shift (users start browsing on mobile instead of desktop, new product categories launch). Concept drift happens when the relationship between features and labels changes (click through rates drop during economic downturns even with same content). Label shift means the distribution of outcomes changes (fraud attacks concentrate on high value transactions).
Staleness budgets formalize acceptable delays. Define Service Level Objectives (SLOs) like "features updated within 5 minutes," "model trained on last 14 days of data," and "refreshed every 24 hours." Uber enforces these budgets across thousands of models: real time ride matching features must update within minutes, while pricing models can tolerate daily refreshes. The key metric is Population Stability Index (PSI), which measures distribution shift. PSI values above 0.1 indicate minor drift, above 0.2 significant drift requiring investigation, and above 0.25 major drift triggering automatic retrain.
In practice, set thresholds with hysteresis to avoid false positives from transient spikes. Airbnb requires drift sustained over 100,000 impressions before triggering retrain, preventing "retraining storms" from temporary traffic anomalies. Netflix monitors multiple signals simultaneously: feature distributions via Kolmogorov Smirnov (KS) tests, prediction calibration error (target within 0.01), and business metrics like play start rate. Only trigger when multiple signals align and effect size exceeds thresholds.
💡 Key Takeaways
•Population Stability Index (PSI) quantifies distribution shift: PSI between 0.1 and 0.2 indicates minor drift, above 0.2 signals significant drift requiring action, and above 0.25 triggers automatic retrain in most production systems
•Hysteresis prevents retraining storms: Airbnb requires drift sustained over 100,000 impressions before triggering, filtering out transient traffic spikes from product launches or marketing campaigns
•Multi signal monitoring catches different failure modes: Netflix tracks feature distributions via Kolmogorov Smirnov tests, prediction calibration error (threshold 0.01), and business metrics like play start rate simultaneously
•Staleness budgets formalize acceptable delays: streaming features updated within 1 to 15 minutes for behavioral signals, batch features within 24 hours for long horizon aggregates, models retrained when data age exceeds 7 to 28 days
•False negatives are more dangerous than false positives: silent performance degradation loses revenue, so set conservative thresholds and monitor multiple metrics rather than relying on a single signal
📌 Examples
Uber fraud detection monitors transaction PSI, velocity features, and conversion rates with 15 minute windows, triggering retrain when PSI exceeds 0.25 or fraud rate spikes by 50% sustained over 2 hours
Meta ad ranking tracks per advertiser segment drift separately because global metrics can hide tail regressions, enforcing per segment AUC ROC thresholds and triggering segmented retrains