Time Series ForecastingStatistical Models (ARIMA, Exponential Smoothing)Medium⏱️ ~3 min

Evaluation and Monitoring: Backtesting, Residuals, and Drift Detection

Production forecasting systems require rigorous evaluation and continuous monitoring to maintain accuracy and detect degradation. Rolling origin cross validation is the gold standard for backtesting: fix a training window, forecast h steps ahead, move forward one period, and repeat. This simulates real deployment where models forecast into the future without seeing actuals. Test at realistic horizons such as 1, 7, and 30 days for retail planning or 5 minutes, 1 hour, and 4 hours for real time traffic systems. Metric selection depends on use case. Mean Absolute Percentage Error (MAPE) is common but fails for series with zeros. Symmetric MAPE (sMAPE) addresses this but still has scale issues. Mean Absolute Scaled Error (MASE) is scale free and compares forecast error to a naive seasonal baseline, making it robust across series with different volumes. For service level decisions, track weighted metrics reflecting cost asymmetry: underforecasting stockouts may cost 10x more than overforecasting holding costs, so penalize errors directionally. Target accuracy like MAPE under 10 percent at weekly horizon for high volume series, relaxing to 20 percent for long tail. Residual diagnostics catch model misspecification. Compute autocorrelation function of residuals and run Ljung Box tests. Significant autocorrelation at seasonal lags indicates missing seasonal terms or wrong period. Flag series where Ljung Box p value drops below 0.05. Track the percentage of series with residual variance spikes after promotional events: more than 20 percent flagged suggests systematic undermodeling of events. Drift and break detection monitors structural changes. Track level shifts using CUSUM or moving averages of residuals. When cumulative sum exceeds a threshold like 3 sigma of residual variance, trigger state reset or retrain. Monitor seasonality amplitude drift: if weekly seasonal indices shift by more than 30 percent over a quarter, seasonality may be evolving and requires model update or extension to time varying seasonal methods. Data drift includes changes in input distributions: new product launches, market entry, or behavior shifts post pandemic. Interval calibration is critical for downstream planning. Nominally 90 percent prediction intervals should cover actuals 90 percent of the time. Production systems often see undercoverage: intervals cover only 75 to 80 percent due to heavy tailed shocks. Use rolling backtests to estimate empirical coverage and apply correction factors. Widen intervals by multiplying forecast standard deviation by a calibration constant, typically 1.2 to 1.5, until coverage matches target within 5 percentage points.
💡 Key Takeaways
Rolling origin cross validation: fix training window, forecast h steps ahead, advance one period, repeat at realistic horizons like 1, 7, 30 days for retail or 5 minutes, 1 hour for traffic
MASE metric is scale free and robust: compares forecast error to naive seasonal baseline, works across series with different volumes unlike MAPE
Residual diagnostics: Ljung Box p value under 0.05 flags autocorrelation indicating missing seasonality, track percent of series with variance spikes after events
Drift detection: CUSUM on residuals triggers state reset at 3 sigma level shift, seasonal amplitude drift over 30 percent in a quarter requires model update
Interval calibration: production intervals often undercover at 75 to 80 percent instead of nominal 90 percent, apply 1.2 to 1.5x multiplier to match target within 5 points
Service level metrics: weight errors by cost asymmetry when stockouts cost 10x more than overstock, target MAPE under 10 percent for high volume, 20 percent for long tail
📌 Examples
Amazon weekly demand forecast: rolling origin validation at 7 and 30 day horizons, MASE under 0.8 for top 100K items, interval calibration factor 1.3 for 90 percent coverage
Uber zone demand: 5 minute and 1 hour backtests, Ljung Box tests flag 8 percent of zones with residual autocorrelation at lag 168 (weekly), trigger SARIMA order update
Airbnb booking forecast: CUSUM detects 40 percent level shift from policy change, automatic state reset restores MAPE from 25 percent to 12 percent within 48 hours
← Back to Statistical Models (ARIMA, Exponential Smoothing) Overview
Evaluation and Monitoring: Backtesting, Residuals, and Drift Detection | Statistical Models (ARIMA, Exponential Smoothing) - System Overflow