Time Series Forecasting • Multi-horizon ForecastingMedium⏱️ ~3 min
Modeling Strategies: Recursive vs Direct Multi-Output vs Per-Horizon Models
Three architectural patterns dominate multi-horizon forecasting, each with sharp trade-offs in accuracy, error propagation, and computational cost. Recursive models predict one step ahead, then feed that prediction back as input to predict the next step, repeating for all horizons. They are conceptually simple and can achieve high accuracy at horizon 1. However, errors compound quickly. A 5% error at step 1 becomes 15% by step 5 and 40% by step 20. In operations, this manifests as optimistic staffing for the second hour even when the first hour already underperformed.
Direct multi-output models predict all horizons simultaneously in a single forward pass. Architectures like sequence-to-sequence models, N-BEATS, or Temporal Fusion Transformers (TFT) take a historical window (60 to 180 steps) and known future covariates, then output all 28 or 60 future values at once. These models avoid error propagation and can share learned representations across horizons, capturing patterns like weekday/weekend structure or promotional lift curves. The trade-off is that they sometimes underfit near-term horizons unless you explicitly weight early horizons more in the loss function. Training complexity also increases because the model must learn correlations across all horizons jointly.
The middle ground trains one model per horizon. Horizon 1 gets a dedicated model optimized for short-term accuracy, horizon 7 gets another, and so on. This is robust to horizon-specific patterns and prevents one bad horizon from degrading others. The cost is linear in the number of horizons. For a 28 day forecast, you train and serve 28 models. At scale with millions of series, this approach becomes prohibitively expensive. Teams use it selectively, for example training separate models for horizons 1, 7, 14, 28 and interpolating intermediate horizons.
Production systems increasingly favor direct multi-output models with global architectures that share parameters across thousands to millions of related series. A single Temporal Fusion Transformer trained on 5 million retail SKUs learns shared seasonality, promotional patterns, and holiday effects, then personalizes with static embeddings (store type, region) and SKU-specific embeddings. This approach scales and handles sparse or cold start series. Amazon's forecasting service uses this pattern. Training runs nightly on 32 to 64 GPUs for 2 to 6 hours. Serving generates 100 million forecasts in 30 to 90 minutes, meeting overnight SLA requirements for downstream replenishment systems.
💡 Key Takeaways
•Recursive models predict one step, feed it back, repeat. Simple but errors compound: 5% at step 1 becomes 40% by step 20. Production shows optimistic staffing for later hours
•Direct multi-output models (Temporal Fusion Transformer, N-BEATS, sequence-to-sequence) predict all horizons in one pass. No error propagation, shared structure across horizons, but can underfit near term
•Per-horizon models train one model per horizon. Robust and horizon-specific, but cost scales linearly (28 horizons = 28 models). Prohibitive at scale, used selectively for key horizons
•Global models share parameters across millions of series. Learn common seasonality and promotional patterns, personalize with static embeddings. Scales to 5 million SKUs with one model
•Amazon retail uses global direct multi-output models trained nightly on 32 to 64 GPUs (2 to 6 hours), serving 100 million forecasts in 30 to 90 minutes for overnight SLA
•Loss weighting by horizon mitigates near-term underfit in direct models. Weight early horizons 2x to 3x higher, then decay for later horizons
📌 Examples
Recursive failure: A retail model predicts day 1 with 5% error, but by day 7 error reaches 18% and day 14 hits 35% due to compounding
Direct multi-output: Temporal Fusion Transformer on 5 million SKUs, 60 day history, 28 day horizon, 7 quantiles. Trained on 64 GPUs in 4 hours, serves all forecasts in 60 minutes
Per-horizon for critical use case: Train separate models for horizons 1, 3, 7, 14, 28 only (5 models instead of 28), interpolate intermediate horizons with weighted average