How to Implement Forecast Evaluation at Scale
Forecast Evaluation at Scale: A production framework for computing and aggregating forecast accuracy metrics across millions of time series with varying characteristics, hierarchies, and business importance levels.
Distributed Metric Computation
Evaluating millions of forecasts requires distributed processing. Partition time series by business unit or product category. Use MapReduce patterns: map phase computes per-series metrics (MAPE, RMSE, bias), reduce phase aggregates to hierarchical summaries. Process incrementally as new actuals arrive rather than batch recomputation. Cache intermediate results at each hierarchy level for sub-minute dashboard latency.
Weighted Aggregation Strategies
Simple averaging misleads when importance varies dramatically. Weight metrics by business impact: revenue contribution, margin, or strategic importance. Compute weighted MAPE where high-revenue products dominate aggregate scores. Alternatively, use volume-weighted approaches. Always report both weighted and unweighted metrics—discrepancies reveal whether the model prioritizes correctly.
Stratified Analysis Framework
Aggregate metrics hide important patterns. Stratify by: product lifecycle (new vs mature), demand pattern (smooth vs intermittent vs seasonal), volume tier (high/medium/low movers), and forecast horizon. Create performance matrices across these dimensions. This identifies systematic weaknesses—perhaps the model struggles with new product launches or long-horizon seasonal items specifically.
Automation: Implement anomaly detection on metrics. Alert when weekly MAPE increases >10% relative, or when segments show sudden degradation. This catches pipeline issues and drift before business impact accumulates.
Baseline Tracking
Maintain baseline comparisons: naive forecasts, statistical baselines (moving average, exponential smoothing), and previous model versions. Report skill scores showing improvement over baselines. Dashboards should show current accuracy, baseline accuracy, and delta—contextualizing whether 15% MAPE is good (baseline was 25%) or concerning (baseline was 12%).