Time Series ForecastingModel Evaluation (MAPE, RMSE, Forecast Bias)Hard⏱️ ~3 min

How to Implement Forecast Evaluation at Scale

Production forecast evaluation at tens or hundreds of millions of predictions requires careful architectural choices around compute, storage, stratification, and alerting. The goal is to surface actionable insights to multiple stakeholders within minutes while avoiding common pitfalls. Start with rolling origin backtesting for offline evaluation. Split your historical data into K folds, typically 8 to 12 monthly folds for retail or 4 weekly folds for logistics. For each origin time t, train on all data up to t, forecast horizons h from 1 to H (often 1 to 13 weeks), and materialize predictions with keys for entity ID, timestamp, forecast creation time, and horizon. At 50 million SKU locations and 13 horizons per fold, you generate 650 million forecast points per fold. Store predictions and actuals in columnar formats like Parquet partitioned by cohort and horizon. A distributed Spark or Dask job on a few hundred vCPUs can scan 200 to 400 GB and compute stratified metrics in 20 to 40 minutes. Stratification is essential to prevent Simpson's paradox and surface meaningful insights. Predefine cohorts by volume quantiles (A/B/C movers), demand pattern (intermittent versus continuous using percent zero weeks), and seasonality strength. Compute metrics per cohort and horizon, not just global aggregates. This enables targeted model improvements and prevents a regression on high value segments from being hidden by improvements on irrelevant long tail. For online monitoring, maintain rolling one hour, 24 hour, and seven day windows for bias, Weighted Absolute Percentage Error (WAPE), and RMSE. Use streaming aggregators that update incrementally as actuals arrive. Trigger alerts when bias for A movers exceeds 3 to 5% for two consecutive days, or when WAPE increases more than 30% relative to a 28 day baseline. Include change point detection algorithms to reduce alert fatigue during known events like promotions or holidays. When an alert fires, dashboards should surface correlations with feature drift, data pipeline delays, and recent model deployments. Handle zeros and intermittency explicitly. At the series level, exclude actual equals zero from MAPE or use Mean Absolute Scaled Error (MASE) that scales by naive forecast error. At aggregate levels, use WAPE which naturally handles zeros because the denominator is the sum of actuals across many series. For confidence intervals, use blocked bootstrap on time series residuals or time aware resampling. At 100 million points, running 1,000 bootstrap samples is expensive; use stratified subsampling at 1% per cohort to compute approximate intervals within 5 minutes. Governance ties everything together. Model promotion requires passing a metric bundle: for example, WAPE improvement on A movers by at least 2 percentage points, absolute bias under 2%, and RMSE not worse than baseline by more than 5%. Publish cohort dashboards so teams cannot game one metric at the expense of another. Log predictions with model version, feature hash, and forecast creation time. Log actuals with consistent keys and ingestion timestamps. Build evaluators that join on entity, forecast creation time, horizon, and realization time to prevent horizon leakage and enable per release comparisons.
💡 Key Takeaways
Rolling origin backtesting: split last 104 weeks into 8 to 12 folds, train up to origin t and forecast next H horizons, generates 650 million points per fold at retail scale
Distributed compute: Spark or Dask job on few hundred vCPUs scans 200 to 400 GB Parquet partitioned by cohort and horizon, aggregates metrics in 20 to 40 minutes
Stratify by volume quantiles (A/B/C movers), demand pattern (intermittent vs continuous), and seasonality to surface segment specific insights and prevent Simpson's paradox
Online monitoring: rolling one hour, 24 hour, seven day windows for bias, WAPE, RMSE with alerts at bias over 3 to 5% for two days or WAPE up 30% versus 28 day baseline
Handle zeros with WAPE at aggregate (sum of errors over sum of actuals) and MASE at series level (scale by naive forecast error), exclude actual equals zero from MAPE
Model promotion gate requires metric bundle: WAPE improvement AND low bias AND RMSE not regressed, log predictions with model version and features hash for reproducibility
📌 Examples
Retail at 50M SKU locations: 650M predictions per backtest fold stored in Parquet partitioned by product category and week, enables per category deep dives in under 1 minute
Streaming aggregator: incrementally updates seven day WAPE as actuals arrive with 2 hour lag, triggers alert when A mover bias crosses negative 5% threshold for 48 hours
Bootstrap confidence intervals: 1% stratified subsample per cohort runs 1,000 bootstrap iterations to estimate 95% confidence bands for WAPE and RMSE in 5 minutes on 64 cores
Promotion gate failure: candidate model improved overall WAPE 3 points but A mover bias was negative 4%, failed promotion despite aggregate improvement due to stockout risk
← Back to Model Evaluation (MAPE, RMSE, Forecast Bias) Overview