Evaluation Failure Modes and Metric Gaming Risks
Evaluation metrics can mislead you if you do not understand their failure modes. Models can score well on benchmarks while failing badly in production, and teams can inadvertently game metrics without improving real performance.
🎭 METRIC GAMING RISKS
Benchmark overfitting: When you repeatedly evaluate on the same test set, model improvements start fitting to that specific data. The model learns the quirks of your benchmark rather than general detection ability. Solution: Hold out a final test set that you never touch during development.
Class weighting manipulation: mAP weights all classes equally, but your application might care more about some classes. A model optimized for mAP might sacrifice rare-but-critical classes for easy gains on common classes. Solution: Report per-class AP and define importance-weighted metrics.
Threshold tuning on test data: Choosing your confidence threshold by looking at test set performance inflates your numbers. The threshold that works best on your test set might not generalize. Solution: Use a separate validation set for threshold selection.
⚠️ EVALUATION BLIND SPOTS
Distribution shift: Your test set might not represent production data. Benchmark images are often curated, well-lit, and clearly composed. Production images include motion blur, occlusion, unusual angles, and edge cases nobody thought to include in the test set.
Temporal correlation: If training and test images come from the same video sequences, the model might recognize backgrounds rather than objects. Always split by video or recording session, not by individual frames.
Label noise: Ground truth is not actually ground truth - it is human annotation with errors. Inter-annotator disagreement of 5-10% is common for detection tasks. A model that disagrees with noisy labels might actually be correct.
🔍 HIDDEN FAILURE MODES
Small object collapse: mAP averages over all object sizes, but small objects are much harder to detect. High overall mAP can hide terrible small-object performance. Report AP by object size bucket.
Confidence miscalibration: A model might rank detections correctly (good mAP) while having poorly calibrated confidence scores. A 0.9 confidence detection should be correct 90% of the time - but many models are overconfident.