Learn→Time Series Forecasting→Deep Learning for Time Series (LSTMs, Transformers)→3 of 5

Time Series Forecasting • Deep Learning for Time Series (LSTMs, Transformers)Medium⏱️ ~2 min

Global Multi-Horizon Forecasting Models

Traditional time series approaches train one model per series. At scale, this becomes intractable: a retailer with 5 million Store Keeping Unit (SKU) store combinations cannot maintain 5 million separate models. Global models solve this by training a single neural network across all entities simultaneously, using embeddings to distinguish between them.

The architecture adds entity identifier embeddings to the input. Each item, store, and category gets a learned vector representation that captures its characteristics. These embeddings are concatenated with the time series values and fed into the LSTM or Transformer. The model learns shared temporal patterns (weekly seasonality, holiday effects, trend dynamics) while the embeddings allow it to specialize predictions per entity. This sharing of statistical strength is powerful: sparse series with only a few months of data benefit from patterns learned on millions of other series.

Multi-horizon forecasting extends this further by predicting multiple future time steps simultaneously rather than just the next step. Instead of one output neuron, you have 48 or 168 output neurons corresponding to the next 48 hours or 7 days. This direct multi-horizon approach avoids error accumulation that happens when you recursively predict one step, feed it back, predict again, and so on. Amazon retail systems commonly use 48 hour horizons for replenishment planning, while Uber demand forecasting targets 2 to 4 hour horizons for driver allocation.

Production systems output probabilistic forecasts using quantile regression. Instead of predicting a single point value, the model outputs P10, P50 (median), and P90 quantiles. This captures forecast uncertainty and enables risk based decisions. For example, inventory planning uses P90 to set safety stock when the cost of stockouts exceeds the cost of overstock. The loss function is pinball loss, which penalizes under-prediction and over-prediction asymmetrically based on the quantile.

The challenge is maintaining forecast quality across the entire distribution of entities. Head SKUs with abundant data and stable demand might achieve Weighted Absolute Percentage Error (WAPE) under 5%, while long tail items with intermittent demand may only reach 15 to 20% WAPE. Large retailers target aggregate WAPE under 15% at weekly horizons. Monitoring must track metrics by cohort (head vs tail, category, region) and trigger retraining or fallback rules when specific segments degrade.

💡 Key Takeaways

✓Global models train a single network across millions of entities using learned embeddings (32 to 64 dimensions) for item ID, store, and category, sharing statistical strength and reducing model count from millions to one

✓Direct multi-horizon outputs predict all H future steps simultaneously (commonly 48 to 168 steps) avoiding recursive error accumulation that can double Mean Absolute Percentage Error (MAPE) from H+1 to H+24

✓Probabilistic forecasting with quantile regression outputs P10, P50, P90 via pinball loss, enabling risk based decisions like using P90 for 90% service level inventory planning

✓Production targets: Weighted Absolute Percentage Error (WAPE) under 15% at weekly horizon for long tail SKUs, under 5% for head items, with coverage of 80 to 90% for P10 to P90 intervals

✓Cold start advantage: new products or stores leverage embeddings mapped to similar categories, achieving reasonable forecasts from day one instead of requiring months of history for local models

📌 Interview Tips

1Amazon retail replenishment: Global model across 20 million SKU-store pairs with 48 hour multi-horizon, item and store embeddings (dim 32), predicts P10/P50/P90 quantiles, nightly batch of 20M forecasts in under 10 minutes

2Uber demand allocation: Global LSTM with city and zone embeddings forecasts 2 hour demand across 5000 zones in 50 cities, online scoring at 2000 QPS with p99 under 50ms using micro-batching of 8 requests

3Retail planning system: 168 step multi-horizon model outputs used directly for weekly replenishment, P90 quantile sets safety stock achieving 92% service level vs 85% with point forecasts

← Back to Deep Learning for Time Series (LSTMs, Transformers) Overview