Evaluation Failure Modes and Metric Gaming Risks

Evaluation metrics can mislead you if you do not understand their failure modes. Models can score well on benchmarks while failing badly in production, and teams can inadvertently game metrics without improving real performance.
🎭 METRIC GAMING RISKSBenchmark overfitting: When you repeatedly evaluate on the same test set, model improvements start fitting to that specific data. The model learns the quirks of your benchmark rather than general detection ability. Solution: Hold out a final test set that you never touch during development.
Class weighting manipulation: mAP weights all classes equally, but your application might care more about some classes. A model optimized for mAP might sacrifice rare-but-critical classes for easy gains on common classes. Solution: Report per-class AP and define importance-weighted metrics.
Threshold tuning on test data: Choosing your confidence threshold by looking at test set performance inflates your numbers. The threshold that works best on your test set might not generalize. Solution: Use a separate validation set for threshold selection.
⚠️ EVALUATION BLIND SPOTSDistribution shift: Your test set might not represent production data. Benchmark images are often curated, well-lit, and clearly composed. Production images include motion blur, occlusion, unusual angles, and edge cases nobody thought to include in the test set.
Temporal correlation: If training and test images come from the same video sequences, the model might recognize backgrounds rather than objects. Always split by video or recording session, not by individual frames.
Label noise: Ground truth is not actually ground truth - it is human annotation with errors. Inter-annotator disagreement of 5-10% is common for detection tasks. A model that disagrees with noisy labels might actually be correct.
🔍 HIDDEN FAILURE MODESSmall object collapse: mAP averages over all object sizes, but small objects are much harder to detect. High overall mAP can hide terrible small-object performance. Report AP by object size bucket.
Confidence miscalibration: A model might rank detections correctly (good mAP) while having poorly calibrated confidence scores. A 0.9 confidence detection should be correct 90% of the time - but many models are overconfident.

🚨 Critical Warning: Never celebrate a metric improvement without understanding where it came from. A 2% mAP gain from better small-object detection is more valuable than a 5% gain from overfitting to your benchmark.

💡 Key Takeaways

✓Benchmark overfitting inflates numbers - hold out a final test set you never touch during development

✓mAP equal class weighting can hide failures on rare-but-critical classes; report per-class AP separately

✓Distribution shift between benchmark and production data is the most common source of deployment surprises

✓Small object performance often hidden in overall mAP; always report metrics by object size bucket

📌 Interview Tips

1Interview Tip: When discussing evaluation, mention temporal correlation as a common data leakage source - splitting by frames instead of videos inflates scores

2Interview Tip: Explain confidence calibration as separate from ranking quality - mAP measures ranking, but calibration matters for downstream decision-making

← Back to Evaluation (mAP, IoU, Precision-Recall) Overview