Time Series Forecasting • Multi-horizon ForecastingHard⏱️ ~3 min
Failure Modes and Edge Cases in Multi-Horizon Systems
Production multi-horizon forecasting systems face a constellation of failure modes that can silently degrade accuracy or catastrophically break operational decisions. Understanding these edge cases is critical for building resilient systems.
Data leakage is the most insidious failure. Using future information by accident, such as a promotion decided on day 10 in a backtest for a forecast created on day 5, produces unrealistically low validation errors but production performance collapses overnight. Teams prevent this with strict cutoff times and versioned data contracts. For known futures, store values with a valid from timestamp reflecting business commitment time. Training and backtesting only read values where commitment precedes forecast creation.
Regime shifts break stationarity. Promotions, stockouts, policy changes, or pandemics invalidate models trained on the last 2 years. Signals include horizon 1 Mean Absolute Percentage Error (MAPE) doubling from 8% to 16% within a week and quantile coverage dropping from 90% to 65%. Mitigation includes change detection algorithms, faster retraining windows (daily instead of weekly), and models with explicit regime indicator features like a binary stockout flag or promotional intensity score.
Cold start and sparse series cripple local models. New SKUs or new delivery zones provide no history. A per-series model cannot forecast. Global models with static features (store type, product category) and similarity-based embeddings provide reasonable priors. However, quantile coverage is often poor for the first 30 to 60 days until enough actuals accumulate. Systems typically use a broader category-level forecast as a fallback until series-specific data reaches a threshold of 30 observations or 0.5 coefficient of variation.
Intermittent demand creates specialized challenges. Many SKUs have long runs of zeros with occasional spikes (for example, 0, 0, 0, 0, 15, 0, 0, 0, 22). Standard squared error losses over-penalize false positives (predicting 2 when actual is 0) and under-penalize missed spikes (predicting 0 when actual is 15). The model learns to predict near zero always, underestimating safety stock. Specialized losses like pinball loss for quantiles or Croston method for intermittent demand help. Otherwise, the model fails for 30% to 50% of long tail SKUs.
Exogenous forecast errors propagate through the pipeline. Systems often feed weather forecasts or upstream price predictions as known futures. If the weather service has a 20% bias toward warmer temperatures, your demand forecast inherits that bias. Scenario ensembles (warm, median, cold paths) reduce brittleness. Some teams run three parallel forecasts and present min, median, max to planners.
Hierarchical inconsistency frustrates inventory planning. Forecasts at SKU level do not sum to category or regional forecasts when generated independently. Planners see SKU forecasts totaling 10,000 units but category forecast at 9,200 units. Reconciliation methods like bottom-up (sum SKUs), top-down (split category proportionally), or MinT (optimal using covariance of forecast errors) enforce consistency. MinT typically improves aggregate WAPE by 1 to 3 percentage points but adds 10 to 20 minutes to the pipeline. Teams often run MinT weekly for planning and bottom-up daily for operations.
Miscalibrated quantiles lead to operational failures. A 90th percentile forecast that only covers 70% of actuals causes stockouts or surge pricing errors. Track coverage per segment and horizon. For example, long tail SKUs might show 75% coverage while top 100 SKUs hit 92%. Recalibrate with isotonic regression or post-hoc scaling. Monitor coverage weekly and retrain if deviation exceeds 5 percentage points for two consecutive weeks.
Latency spikes and stale features break real-time systems. If a 5 minute update window is missed, dispatch or pricing receives forecasts based on data from 10 minutes ago. Use freshness checks on feature timestamps and circuit breakers. If lag exceeds 3 minutes, fall back to a cached forecast with a 0.95 decay factor per missed update. Uber's systems employ this pattern, switching to fallback within 200 ms if primary inference times out.
💡 Key Takeaways
•Data leakage from using future information (promotion decided day 10 in forecast at day 5) produces 3% backtest error but 22% production error. Prevent with versioned data and strict time cutoffs
•Regime shifts (pandemic, stockout, policy change) cause horizon 1 MAPE to double (8% to 16%) and quantile coverage to drop (90% to 65%). Mitigate with daily retraining and regime indicator features
•Cold start for new SKUs or zones cripples local models with no history. Global models with static features provide priors but show poor quantile coverage for first 30 to 60 days. Use category fallback until 30 observations
•Intermittent demand (long zeros with spikes: [0,0,0,15,0,0,22]) causes standard losses to predict near zero always, missing spikes. Affects 30% to 50% of long tail SKUs. Use pinball loss or Croston method
•Hierarchical inconsistency: SKU forecasts sum to 10,000 but category forecast is 9,200. Reconcile with MinT (1 to 3% WAPE improvement) or bottom-up. Run MinT weekly for planning, bottom-up daily for operations
•Latency spikes and stale features break real-time systems. If 5 minute update missed, fall back to cached forecast with 0.95 decay per missed cycle. Uber switches to fallback within 200 ms if primary times out
📌 Examples
Leakage failure: Promotion flag set 3 days before launch used in backtest 7 days before launch. Validation MAPE is 4%, production MAPE jumps to 18% after deployment
Regime shift: COVID lockdown changes demand patterns overnight. Model trained on 2019 data shows 45% MAPE in March 2020. Daily retraining with 30 day window reduces to 12% within two weeks
Intermittent demand: Spare parts SKU sells 0 units 90% of weeks, then spikes to 10 to 20 units. Standard model predicts 0.5 units always, missing safety stock. Croston method captures spikes, reduces stockouts by 40%