Time Series ForecastingModel Evaluation (MAPE, RMSE, Forecast Bias)Hard⏱️ ~3 min

What Are the Key Failure Modes in Forecast Evaluation?

Forecast evaluation fails in predictable ways that can invalidate your metrics, hide regressions, or incentivize harmful model behavior. Understanding these failure modes is essential for building robust production systems. Zero and near zero actuals break percentage based metrics catastrophically. MAPE becomes infinite when actual equals zero, common in intermittent demand for spare parts or long tail SKUs. Even near zero actuals inflate MAPE dramatically: if actual equals 1 and forecast equals 5, that's a 400% error. A small set of such points can dominate the mean and make MAPE meaningless. Solution: use WAPE at aggregate levels (sum of errors divided by sum of actuals) or switch to Mean Absolute Scaled Error (MASE) that scales by naive forecast performance. Aggregation bias creates Simpson's paradox where overall metrics improve while business critical segments degrade. Averaging MAPE across series without volume weighting lets improvements on irrelevant long tail SKUs mask regressions on high revenue A movers. A team might celebrate 3 point MAPE improvement while WAPE on revenue driving categories worsened by 2 points. Always compute weighted aggregate metrics and publish stratified cohort reports so trade-offs are visible. Metric gaming emerges when teams optimize directly for a single metric. One retail team improved MAPE from 22% to 18% by implementing a policy that clipped all forecasts above the 90th percentile of historical demand. MAPE improved because conservative forecasts reduced over prediction penalties. Fill rate on A movers dropped 3 percentage points due to increased stockouts. The cost to revenue was far larger than any efficiency gain from lower MAPE. Prevention requires metric bundles for promotion gates and publishing multiple complementary metrics. Cancellation in bias hides dispersion completely. Positive and negative errors cancel algebraically, so a portfolio with 50 over forecasts of +100 units and 50 under forecasts of -100 units reports zero bias despite 10,000 units of total error and catastrophic service levels. Bias must always be paired with MAE, RMSE, or WAPE to detect dispersion. Outliers and sensor errors dominate quadratic metrics. RMSE is extremely sensitive: 0.1% of trip Estimated Time of Arrival (ETA) predictions with 20 minute errors can raise aggregate RMSE by several seconds across millions of trips. A retail forecaster saw RMSE spike from 45 to 78 units after a sensor malfunction mislabeled 200 SKU days with 10 times actual demand. Only 0.1% of the data was corrupted, but it moved the metric by 73%. Solution: investigate outliers separately, consider winsorizing errors at 99.5th percentile for diagnostics, but keep uncapped versions for final reporting.
💡 Key Takeaways
Zero actuals make MAPE infinite and near zero actuals inflate it dramatically: actual of 1 with forecast of 5 creates 400% error that dominates aggregate means
Aggregation bias creates Simpson's paradox: unweighted MAPE can improve 3 points while revenue driving A movers degrade 2 points if long tail dominates the average
Metric gaming from single metric optimization: clipping forecasts improved MAPE from 22% to 18% but dropped fill rate 3 percentage points costing millions in stockouts
Bias cancellation hides dispersion: 50 over forecasts of +100 and 50 under forecasts of -100 yield zero bias but 10,000 units total error and terrible service
Outlier sensitivity in RMSE: 0.1% of ETA predictions with 20 minute errors raise aggregate RMSE by several seconds, sensor malfunction on 0.1% of data moved RMSE by 73%
Distribution shift during promotions or regime changes collapses historical baselines, causing false positive alerts if thresholds are unified instead of segmented by event type
📌 Examples
Intermittent demand failure: Spare parts with many zero actual weeks produce 400%+ MAPE on any non zero forecast, making metric useless for that category
Retail gaming incident: Forecast clipping policy improved MAPE 4 points but caused 3 percentage point fill rate drop on A movers worth millions in lost sales
Sensor error impact: Mislabeled 200 SKU days (0.1% of 200K total) with 10x demand moved RMSE from 45 to 78 units, investigation traced to warehouse scanner firmware bug
Cancellation masking: Distribution center portfolio showed zero bias for 6 months while chronic under forecast on perishables and over forecast on durables both created waste
← Back to Model Evaluation (MAPE, RMSE, Forecast Bias) Overview