Learn→Time Series Forecasting→Scalability (Millions of Time Series, Hierarchical Forecasting)→5 of 6
Time Series Forecasting • Scalability (Millions of Time Series, Hierarchical Forecasting)Hard⏱️ ~3 min
Production Pipeline: From Data Ingestion to Online Serving at Scale
A production hierarchical forecasting system operates as a batch pipeline that runs daily or hourly. The pipeline has four stages: data ingestion and compaction, base forecasting, reconciliation, and online serving. Each stage has distinct scale and latency requirements.
Data ingestion handles streaming sales or event data at 50,000 to 200,000 records per second. Raw events land in a write optimized store such as Apache Kafka or Amazon Kinesis. A daily snapshot task compacts the previous 90 days of data into pre aggregated panels at all hierarchy levels, reducing read Input/Output (I/O) during training. For 10 million SKU store combinations with 90 days of daily data, the compacted dataset contains roughly 900 million rows. Storing this in columnar format like Apache Parquet with compression yields roughly 50 to 100 Gigabytes (GB), which fits on a single node and can be scanned in under 2 minutes on modern Solid State Drives (SSDs).
Base forecasting runs on this compacted data. A global gradient boosted tree or neural model trains once across all series. Training a LightGBM model on 900 million rows with 50 features takes 2 to 4 hours on a 64 core machine with 256 GB memory. After training, inference generates a 28 day forecast for 10 million leaves, producing 280 million prediction rows. At 10 microseconds per row, a 64 vCPU node scores 6.4 million rows per second, completing the 280 million rows in roughly 45 seconds. Fanning out across 10 nodes reduces this to under 10 minutes including coordination overhead.
Reconciliation takes the base forecasts and enforces aggregation constraints. For diagonal covariance and subtree parallelism, reconciling 10 million nodes factors into 5,000 independent store level problems, each with roughly 2,000 nodes. On a 100 core cluster, each core handles 50 stores. Each store reconciliation solves in under 1 second, so total reconciliation completes in under 2 minutes. Including matrix assembly and result serialization, end to end reconciliation stays within 5 to 10 minutes. Companies that run optimal covariance estimation weekly budget a few hours for that step, but daily operational forecasts use cached or diagonal covariance and finish in minutes.
Online serving writes reconciled forecasts to a low latency cache such as Redis or DynamoDB. Downstream services such as pricing, inventory allocation, or capacity planning read forecasts with p99 latencies under 50 milliseconds. Forecasts are keyed by series ID and horizon, with a typical read fetching a 28 day vector in a single lookup. For real time adjustments, incremental update systems run lightweight state space filters at the leaf level. A Kalman update for one series takes under 1 millisecond, so a single core can update tens of thousands of series per second. Incremental deltas propagate to ancestors using a fast proportional rule until the next batch recomputation.
Monitoring tracks level weighted error across hierarchy levels, coherence gap between unreconciled sums and reconciled totals, and reconciliation runtime. Alerts fire on hierarchy changes, sudden shifts in allocation proportions, and covariance instability. Fallback policies switch a branch to top down allocation if standard deviation at leaves exceeds a threshold, preventing noise from destabilizing the entire hierarchy. Netflix and Meta capacity teams report running rolling forecasts across thousands of resource pools and regions with daily batch Service Level Agreements (SLAs) under one hour and online what if computation completing in a few seconds.
💡 Key Takeaways
•Data ingestion streams 50,000 to 200,000 events per second, daily snapshot compacts 90 days into 900 million rows and 50 to 100 GB Parquet files, scanned in under 2 minutes on SSDs for training
•Base forecasting trains global LightGBM on 900 million rows in 2 to 4 hours (64 cores, 256 GB memory), inference generates 280 million predictions in under 10 minutes on 10 node cluster at 10 microseconds per row
•Reconciliation with diagonal covariance and subtree parallelism solves 5,000 independent store problems (2,000 nodes each) on 100 cores, each store under 1 second, total reconciliation 5 to 10 minutes
•Online serving writes to Redis or DynamoDB with p99 read latency under 50 milliseconds, incremental Kalman updates take under 1 millisecond per series enabling real time adjustments at tens of thousands of series per second per core
•Monitoring tracks level weighted error, coherence gap, and runtime, with fallback policies switching to top down allocation when leaf standard deviation exceeds threshold to prevent noise destabilization
📌 Examples
Amazon scale: 900 million row daily snapshot, 2 hour LightGBM training, 10 minute inference across 10 nodes, 5 minute reconciliation, total daily batch completes in under 3 hours well within operational SLA
Netflix capacity planning: Daily rolling forecasts across thousands of resource pools, batch SLA under one hour, online what if scenario computation completes in a few seconds for real time capacity decisions
Uber demand: Hourly batch pipeline, 30 minute data compaction, 15 minute inference, 5 minute reconciliation per city in parallel, forecasts available for real time dispatch within one hour of event stream