Time Series ForecastingMulti-horizon ForecastingHard⏱️ ~3 min

Trade-Offs in Multi-Horizon Forecasting Systems

Every architectural decision in multi-horizon forecasting involves sharp trade-offs between accuracy, robustness, cost, and interpretability. Understanding these trade-offs guides system design and helps justify decisions in technical reviews. Accuracy versus robustness is the first tension. Recursive models can achieve very low error at horizon 1 (3% to 5% MAPE) but errors compound exponentially, reaching 30% to 50% by horizon 20. Direct multi-output models sacrifice 1 to 2 percentage points at horizon 1 but maintain stable error growth across all horizons. In retail operations, recursive models produce optimistic staffing for later hours even when early hours already underperformed. Teams typically choose direct models for horizons beyond 5 steps and reserve recursive for ultra short-term (next 1 to 3 steps) where accuracy matters most. Global versus local models trades scale for interpretability. One model per series (local) gives perfect interpretability and local fit, but cannot scale to millions of series and fails catastrophically for sparse or new series. A single global model handles 5 million SKUs with shared seasonality and promotional patterns, reducing training cost from weeks to hours. However, global models risk negative transfer if you mix unrelated series (apparel and automotive parts). Mitigation includes clustering series by behavior, using static embeddings for personalization, and monitoring per-segment performance. In practice, global models win at scale, and teams invest in post-hoc interpretability tools like variable importance and attention weights. Probabilistic versus point forecasts adds complexity for operational value. Quantile forecasts expose asymmetric risks and enable decision optimization. A retailer can choose the 90th percentile for high margin perishables (minimize stockouts) and the 50th percentile for low margin bulk goods (minimize holding cost). This flexibility improves profit by 2% to 5% over point forecasts in Amazon's publicly shared case studies. The cost is increased training complexity (optimize pinball loss for multiple quantiles), larger output size (7 quantiles times 28 horizons per series), and downstream systems must consume distributions rather than single numbers. Many legacy systems cannot, forcing a transition period with dual outputs. Horizon length and granularity explode computational cost. Forecasting 90 days at 15 minute granularity produces 8,640 outputs per series. Training difficulty grows quadratically with horizon length for attention-based models. Teams typically forecast coarse granularity far out (daily for 8 weeks: 56 outputs) and fine granularity near term (15 minutes for 48 hours: 192 outputs). This reduces compute by 10x and improves calibration because long-term fine-grained forecasts are inherently noisy. Feature scope trades accuracy for brittleness. Using known future covariates like planned promotions or committed prices improves accuracy by 5% to 15% over models using only history. However, it creates hard dependencies on upstream planning tools. If marketing changes a promotion schedule after forecast generation, you must invalidate and regenerate all affected forecasts or accept degraded accuracy. Some teams run two parallel systems: a full model with known futures for primary planning and a history-only fallback for robustness when upstream data is unreliable. Interpretability versus raw accuracy creates adoption friction. Temporal Fusion Transformers offer variable selection and attention-based attribution, showing which features drive each horizon. Pure black box gradient boosted trees or LSTMs might achieve 1% to 2% better accuracy but face resistance in finance, operations, and regulated industries where explainability is non-negotiable. Teams often accept the accuracy hit to gain stakeholder trust and pass compliance reviews. Compute and latency budgets force architecture choices. Heavy attention models with 50 million parameters cannot meet a 100 ms online SLA without batching, model distillation, or GPU acceleration. Lightweight gradient boosted trees or small recurrent neural networks (RNNs) hit 20 to 50 ms latency on CPU but sacrifice 3% to 5% long-horizon accuracy. Uber's real-time marketplace uses lightweight models for sub-100 ms latency, while Amazon's overnight batch system uses heavy Temporal Fusion Transformers with 2 to 6 hour training and 60 minute inference because latency is not a constraint.
💡 Key Takeaways
Accuracy vs robustness: Recursive models achieve 3% to 5% MAPE at horizon 1 but error compounds to 30% to 50% by horizon 20. Direct multi-output sacrifices 1 to 2 points at h1 for stable error growth across all horizons
Global vs local: Local models (one per series) are interpretable but fail at scale and for sparse series. Global models handle 5 million SKUs, reduce training from weeks to hours, but risk negative transfer without clustering or embeddings
Probabilistic vs point: Quantile forecasts enable asymmetric risk optimization (90th percentile for high margin perishables, 50th for bulk goods), improving profit 2% to 5% per Amazon case studies. Cost is training complexity and larger outputs (7 quantiles times 28 horizons)
Horizon length and granularity: 90 days at 15 minute intervals produces 8,640 outputs per series, quadratic compute growth for attention models. Teams forecast daily for 8 weeks (56 outputs) far out, 15 minutes for 48 hours (192 outputs) near term, saving 10x compute
Feature scope: Known futures (planned promotions, committed price) improve accuracy 5% to 15% but create dependencies on upstream systems. If plans change, must invalidate and regenerate forecasts. Some teams run parallel history-only fallback
Compute and latency: Heavy attention models (50M params) cannot meet 100 ms SLA without GPU or distillation. Lightweight models hit 20 to 50 ms on CPU but sacrifice 3% to 5% accuracy. Uber uses lightweight for real-time, Amazon uses heavy for overnight batch
📌 Examples
Recursive vs direct: Retail staffing model with recursive shows 4% h1 error, 12% h3, 28% h7. Direct model shows 6% h1, 9% h3, 14% h7. Direct chosen for robustness despite worse h1
Global model: 5 million SKUs trained with one Temporal Fusion Transformer (64 GPUs, 4 hours) vs per-series models (estimated 3 weeks on 500 node cluster). Global wins on cost and handles cold start
Quantile optimization: High margin perishables use p90 forecast (10% overstock cost vs 50% stockout cost). Low margin bulk uses p50 (20% overstock cost vs 20% stockout cost). Profit improves 3.5% over single point forecast
← Back to Multi-horizon Forecasting Overview