Model Monitoring & Observability • Prediction Drift MonitoringMedium⏱️ ~3 min
Baseline Selection Strategies and Trade-offs
The baseline you compare against fundamentally determines what kind of drift you detect and how many false alarms you generate. There is no universal best choice. Each baseline strategy trades off sensitivity, false positive rate, and operational complexity.
Training baseline uses the prediction distribution from your training or validation set. This is the strictest approach and works well immediately after deployment to catch integration bugs or training serving skew. The downside is over-alerting as your product and users naturally evolve. A recommendation model trained on pre-pandemic data will constantly alert on post-pandemic traffic patterns even if the model is working correctly. Rolling baseline uses a moving window of recent production predictions, typically 7 to 30 days. This adapts to gradual shifts and drastically reduces false positives from slow product evolution. However, it can mask slow drift because the baseline itself drifts. If your model degrades slowly over months, a rolling baseline might never trigger because it keeps adapting. Seasonal baseline compares current predictions to the same hour of day and day of week from previous weeks. This handles diurnal and weekly cycles elegantly. Uber uses 7 day seasonal baselines for ETA predictions per city and hour to avoid spurious alerts during predictable commute peaks and weekend patterns.
Most mature systems run multiple baselines in parallel. Use training baseline for the first few weeks post deployment to catch immediate issues. Switch to rolling baseline for ongoing monitoring of stable models. Add seasonal baselines for use cases with strong time patterns like ride sharing, food delivery, or content engagement. For high stakes models like fraud detection or medical diagnosis, maintain a frozen baseline from a known good period and alert when drift from that reference exceeds conservative thresholds, even if rolling metrics look stable.
💡 Key Takeaways
•Training baseline is strictest, ideal for first 2 to 4 weeks post deployment to catch training serving skew and integration bugs, but over-alerts as user behavior and product features naturally evolve
•Rolling baseline over 7 to 30 days adapts to gradual shifts and reduces false positives by 10x, but can mask slow drift over months because the baseline itself drifts with the model degradation
•Seasonal baselines comparing same hour-of-day and day-of-week from 7 days prior eliminate spurious alerts from predictable cycles. Uber uses this for ETA predictions to handle commute peaks and weekend patterns
•Frozen reference baseline from known good period is critical for high stakes domains like fraud and medical diagnosis. Even if rolling metrics look stable, drift from frozen reference triggers investigation
•Storage cost increases with multiple baselines: seasonal requires 168 hourly histograms per model (7 days times 24 hours), rolling needs 30 daily snapshots, but total overhead stays under 50 megabytes per model with histogram aggregation
📌 Examples
Netflix uses training baseline for first month after model deployment, then switches to 14 day rolling baseline. Seasonal baseline added for markets with strong weekend versus weekday viewing patterns
Airbnb pricing model maintains three baselines: 7 day rolling for normal operations, seasonal comparing to same day-of-week from 4 weeks prior to handle monthly booking cycles, and frozen baseline from pre-COVID period to detect structural market shifts
Meta ads ranking switched from training baseline to 30 day rolling after constant alerts from natural campaign mix evolution. Added frozen baseline from stable quarter to catch if predictions degrade back to old patterns