Time Series Forecasting • Model Evaluation (MAPE, RMSE, Forecast Bias)Hard⏱️ ~3 min
How to Build a Production Metric Suite for Forecast Evaluation
Real forecasting systems never rely on a single metric because each captures different properties of error and optimizing for one can degrade others. A production metric suite typically combines a scale free metric like Weighted Absolute Percentage Error (WAPE), a scale dependent metric like RMSE or MAE, and a directional metric like bias. This triangulates accuracy, error severity, and systematic skew across heterogeneous series.
Consider an e-commerce retailer with 50 million SKU location pairs forecasting 1 to 13 weeks ahead. Offline evaluation runs rolling origin backtests across the last 104 weeks. For each origin, train on data up to that week, predict the next 13 weeks, compute metrics per SKU and horizon. A single backtest fold produces 50 million times 13 equals 650 million forecast points. A distributed job on a few hundred virtual CPUs (vCPUs) scans 200 to 400 GB of Parquet and aggregates per product class, warehouse, and region in 20 to 40 minutes. The team surfaces WAPE for business owners (intuitive, scale free), RMSE for engineers (captures large error risk), and bias for supply chain operators (detects systematic skew).
Weighting is critical to reflect business impact. WAPE computes sum of absolute errors divided by sum of actuals, naturally weighting by volume. This prevents small or intermittent series from dominating corporate scorecards while still exposing their accuracy to owning teams through stratified reports. For RMSE across heterogeneous series, either normalize by naive forecast error to get Root Mean Squared Scaled Error (RMSSE), or compute segment level RMSEs and report separately. Never average raw RMSE across series with different magnitudes.
Online monitoring tracks rolling one hour, 24 hour, and seven day windows for bias, WAPE, and RMSE. Alerts trigger when bias exceeds 3 to 5% for high volume cohorts or when WAPE increases by more than 30% relative to a 28 day baseline. Teams include change point detection to reduce alert fatigue during known events like promotions. When metric spikes correlate with feature drift or known distribution shifts, rollbacks or override rules can be applied within one to two hours.
Model promotion requires passing a metric bundle, not just improving one number. For example: WAPE change on A movers less than negative 2 percentage points AND absolute bias under 2% AND RMSE not worse than baseline by more than 5%. This prevents gaming where teams clip high forecasts to improve MAPE but cause stockouts, or optimize RMSE at the expense of systematic under forecasting. Governance dashboards publish cohort metrics so improvements on one segment at the expense of another become visible immediately.
💡 Key Takeaways
•Never use a single metric: combine scale free (WAPE), scale dependent (RMSE), and directional (bias) to triangulate accuracy, error severity, and systematic skew
•Weight by business impact: WAPE uses sum of absolute errors divided by sum of actuals, naturally weighting high volume series while preventing long tail from dominating
•Offline backtesting at scale: 50 million SKU locations times 13 horizons equals 650 million points per fold, distributed job scans 200 to 400 GB in 20 to 40 minutes on hundreds of vCPUs
•Online monitoring with multiple windows: track one hour, 24 hour, and seven day rolling metrics with alerts when bias exceeds 3 to 5% or WAPE increases 30% versus 28 day baseline
•Model promotion requires metric bundle: WAPE improvement AND low bias AND RMSE not regressed, prevents gaming one metric at expense of others like clipping forecasts to reduce MAPE but causing stockouts
•Stratify by cohort and horizon: report A/B/C movers separately, one week versus 13 week horizons, prevents Simpson's paradox where aggregate improves but key segments degrade
📌 Examples
E-commerce with 50M SKU locations: Surfaces WAPE to executives, RMSE to data scientists, bias to supply chain, each metric serves different stakeholder decision making
Rolling origin backtest: Last 104 weeks split into 8 to 12 folds, each fold trains up to origin week and forecasts next 13 weeks, metrics computed per SKU and horizon
Metric gaming prevention: Team improved MAPE from 22% to 18% by clipping high forecasts, but fill rate dropped 3 points on A movers, would have failed promotion gate requiring bias under 2%
Alert correlation: Bias spike during promotion launch traced to feature drift in price elasticity, override rule applied within 90 minutes to revert to pre promotion coefficients