Time Series Forecasting • Multi-horizon ForecastingHard⏱️ ~3 min
Production Pipeline: From Data Assembly to Serving at Scale
A complete production multi-horizon forecasting system has five layers: data assembly, feature generation, modeling, serving, and monitoring. Each layer has strict latency budgets and data contracts to meet operational SLAs.
Consider a retail demand system forecasting 5 million SKUs, 28 days ahead, with 7 quantiles, every night. Data assembly pulls sales transactions, returns, promotions, price changes, stock levels, and calendar events. The pipeline enforces time cutoffs to prevent leakage. Known future prices or promotions are versioned by approval time; only values committed before the forecast creation time are readable. With 3 TB of raw input per day, data assembly completes in 20 to 40 minutes on a distributed object store.
Feature generation is the dominant cost. For each series, the system builds lagged targets (7, 14, 28 days back), rolling statistics (mean and standard deviation over 7, 14, 30 day windows), holiday flags, promotional event windows, and lead-aligned known futures. With 100 million series and 60 to 180 historical steps each, this is 6 to 18 billion feature computations. A 200 node Spark or Ray cluster completes this in 45 to 90 minutes. Feature outputs are partitioned by series ID and written to Parquet for the modeling stage.
Modeling consumes the features, trains or updates a global multi-output model (often a Temporal Fusion Transformer or gradient boosted tree ensemble), and generates quantile forecasts. Training runs nightly with incremental updates or weekly for heavy deep learning models. To bound training cost on 100 million series, teams sample windows proportionally to revenue or recent volatility, ensuring high value SKUs get more training weight. A typical run on 32 to 64 GPUs finishes in 2 to 6 hours. The trained model is versioned and pushed to a model registry.
Serving in batch mode generates forecasts for all series. The inference engine loads the model, reads features in micro-batches of 1,000 to 10,000 series per pass to saturate vector units or GPUs, and writes quantile forecasts (5th, 10th, 20th, 50th, 80th, 90th, 95th percentile) for each horizon. Total inference time is 30 to 90 minutes, producing roughly 19.6 billion numbers (157 GB). Results flow to downstream replenishment services, pricing engines, and workforce planners. Amazon has publicly shared that their managed forecasting service produces quantile forecasts with configurable loss functions, aligning with this architecture.
Now consider a real-time marketplace like Uber forecasting supply and demand every 1 to 5 minutes for 60 minutes ahead at 5 minute granularity. Each city has 10,000 zones. With 100 cities, each update computes 12 million outputs. End to end latency budget is under 300 ms. Feature lookup from a low latency feature store (precomputed static features, cached short lags, streaming recent events) takes 60 to 120 ms. The model forward pass must stay under 50 to 80 ms on CPU with vectorized batching or under 20 ms on GPU. If feature ingestion lags or inference exceeds 200 ms, the system falls back to a last-good forecast with a decay factor for a few update cycles. Uber has described using probabilistic time series methods and multi-horizon forecasts in their Orbit framework for marketplace control.
Monitoring closes the loop. Rolling backtests compute weighted absolute percentage error (WAPE) by horizon and by segment (for example, long tail SKUs under 10 units per week). Calibration checks measure whether the 90th percentile quantile actually covers 90% of realizations over a trailing 30 day window. Drift detection watches input distributions (sudden price changes, event frequency shifts) and output distributions (forecast mean and variance). Alerts fire when horizon 1 MAPE increases by more than 30% for two consecutive days, or when quantile coverage deviates by more than 5 percentage points. Runbooks provide mitigation steps: retrain with a shorter history window, increase horizon 1 loss weight, switch to scenario ensembles, or activate the baseline fallback model.
💡 Key Takeaways
•Data assembly pulls 3 TB per day with strict time cutoffs. Known futures versioned by commitment time to prevent leakage. Completes in 20 to 40 minutes on distributed storage
•Feature generation is the bottleneck: 6 to 18 billion computations for 100 million series over 60 to 180 steps. 200 node cluster finishes in 45 to 90 minutes, often the dominant pipeline cost
•Training on 32 to 64 GPUs for 2 to 6 hours generates a global model. Teams sample by revenue or volatility to bound cost. Model versioned and pushed to registry
•Batch serving generates 19.6 billion forecasts (157 GB) in 30 to 90 minutes with micro-batches of 1,000 to 10,000 series per GPU pass. Meets overnight SLA for replenishment and pricing systems
•Real-time marketplace (Uber, DoorDash) updates every 1 to 5 minutes with 300 ms end to end latency budget: 60 to 120 ms feature lookup, 50 to 80 ms inference on CPU or 20 ms on GPU. Falls back if latency exceeds 200 ms
•Monitoring tracks horizon-specific WAPE, quantile calibration (does 90th percentile cover 90% of actuals), and drift. Alerts on 30% MAPE increase or 5 point coverage deviation with actionable runbooks
📌 Examples
Retail batch pipeline: 3 TB data → 200 node feature gen (60 min) → 64 GPU training (3 hours) → batch inference (60 min) → 157 GB quantile forecasts to replenishment
Uber real-time: 10,000 zones per city, 100 cities, 12 million outputs every 3 minutes. Feature store serves in 80 ms, model inference 40 ms on CPU, total 300 ms including orchestration
Google demand forecasting: Multi-region compute clusters, feature store with 500 ms p99 lookup, fallback to seasonal naive if feature freshness exceeds 5 minute threshold