What Are the Key Failure Modes in Forecast Evaluation?
Evaluation on Wrong Distribution
Models evaluated on artificially balanced data or filtered subsets show inflated accuracy. Evaluation must use the same distribution the model will encounter in production. If 20% of products have sparse history, include them in evaluation even if they drag down metrics. Excluding hard cases creates false confidence.
Warning: Random train-test splits leak temporal information. A model that sees December test data during training implicitly learns December patterns. Always use time-based splits: train on past, test on future.
Metric-Model Mismatch
Models optimized for RMSE may show poor MAPE because RMSE ignores percentage scale. Models optimized for MAPE may underperform on business metrics that care about absolute error. Align training objective with evaluation metric, or accept the mismatch and tune hyperparameters on the metric that matters.
Ignoring Uncertainty
Point forecast metrics (MAPE, RMSE) do not evaluate prediction intervals. A model may have good point accuracy but produce intervals that are too narrow (overconfident) or too wide (uninformative). Evaluate interval coverage: do 90% prediction intervals actually contain 90% of actuals? Calibration matters for decision-making under uncertainty.
Coverage Test: For each confidence level (50%, 80%, 90%), compute what percentage of actuals fall within the predicted interval. Well-calibrated intervals show actual coverage matching stated confidence.
Survivorship Bias
Products that fail quickly disappear from evaluation datasets. Remaining products are inherently more predictable (they survived). This biases accuracy metrics upward. Include discontinued products in evaluation, or at least acknowledge the bias when reporting metrics.