Time Series ForecastingModel Evaluation (MAPE, RMSE, Forecast Bias)Hard⏱️ ~3 min

How to Implement Forecast Evaluation at Scale

Forecast Evaluation at Scale: A production framework for computing and aggregating forecast accuracy metrics across millions of time series with varying characteristics, hierarchies, and business importance levels.

Distributed Metric Computation

Evaluating millions of forecasts requires distributed processing. Partition time series by business unit or product category. Use MapReduce patterns: map phase computes per-series metrics (MAPE, RMSE, bias), reduce phase aggregates to hierarchical summaries. Process incrementally as new actuals arrive rather than batch recomputation. Cache intermediate results at each hierarchy level for sub-minute dashboard latency.

Weighted Aggregation Strategies

Simple averaging misleads when importance varies dramatically. Weight metrics by business impact: revenue contribution, margin, or strategic importance. Compute weighted MAPE where high-revenue products dominate aggregate scores. Alternatively, use volume-weighted approaches. Always report both weighted and unweighted metrics—discrepancies reveal whether the model prioritizes correctly.

Stratified Analysis Framework

Aggregate metrics hide important patterns. Stratify by: product lifecycle (new vs mature), demand pattern (smooth vs intermittent vs seasonal), volume tier (high/medium/low movers), and forecast horizon. Create performance matrices across these dimensions. This identifies systematic weaknesses—perhaps the model struggles with new product launches or long-horizon seasonal items specifically.

Automation: Implement anomaly detection on metrics. Alert when weekly MAPE increases >10% relative, or when segments show sudden degradation. This catches pipeline issues and drift before business impact accumulates.

Baseline Tracking

Maintain baseline comparisons: naive forecasts, statistical baselines (moving average, exponential smoothing), and previous model versions. Report skill scores showing improvement over baselines. Dashboards should show current accuracy, baseline accuracy, and delta—contextualizing whether 15% MAPE is good (baseline was 25%) or concerning (baseline was 12%).

💡 Key Takeaways
Distributed metric computation with MapReduce for millions of series
Weighted aggregation by business impact (revenue, volume, strategic importance)
Stratified analysis revealing patterns hidden in aggregate metrics
📌 Interview Tips
1Partition by business unit with incremental processing as actuals arrive
2Automated alerts when weekly MAPE increases >10% relative
← Back to Model Evaluation (MAPE, RMSE, Forecast Bias) Overview