Computer Vision SystemsEvaluation (mAP, IoU, Precision-Recall)Hard⏱️ ~3 min

Evaluation Failure Modes and Metric Gaming Risks

Evaluation metrics can mislead if you don't understand their failure modes. Small objects expose IoU's harshness. A 2 pixel shift on a 16 by 16 pixel object drops IoU from 0.7 to 0.4, causing correct detections to be counted as false positives. Models with strong AP on large objects can collapse on small object AP, which is why COCO separates metrics by size. A model with overall AP at [.5:.95] of 0.50 might have small object AP of only 0.30. Duplicate detections from weak Non Maximum Suppression (NMS) destroy precision. If one ground truth attracts three high confidence predictions, only the best one matches. The other two become false positives. This creates a steep precision drop at higher recall regions of the curve, making AP look worse even though localization is good. Conversely, overly aggressive NMS can suppress correct detections of nearby objects, tanking recall. Score miscalibration changes thresholded metrics but not necessarily AP. If your model outputs overconfident scores, the precision recall curve becomes steep when the first mistakes appear. However, AP only cares about ranking, so rescaling scores without changing their order leaves AP unchanged. This means a model can have good AP but poor operational performance at fixed thresholds if calibration is off. Dataset issues compound evaluation errors. Missing ground truth boxes cause correct predictions to count as false positives. In crowd scenes where annotators mark group boxes or skip individuals, AP severely understates performance. COCO designates crowd regions specially, but if your dataset doesn't, your metrics will be pessimistic. Domain shift hits harder. A model with mAP of 0.55 on a curated validation set can drop to 0.35 in the wild due to nighttime conditions, motion blur, or occlusion. Long tail categories with few samples drag down macro mAP, making releases look worse despite head class improvements. Always inspect precision recall curves, multiple AP variants (AP at 0.5, AP at 0.75, per class, per size), and failure case galleries to avoid metric gaming and hidden regressions.
💡 Key Takeaways
Small object AP can be 20 points lower than overall AP (0.30 vs 0.50) because IoU is harsh on small boxes where a 2 pixel shift causes large relative overlap changes
Weak NMS creates duplicate detections where one ground truth attracts multiple predictions, counting extras as false positives and dropping precision by 40% despite good localization
Score miscalibration affects thresholded metrics like precision at recall but leaves AP unchanged since AP only depends on ranking order, not absolute score values
Missing or incomplete ground truth labels cause correct predictions to be marked as false positives, understating AP by 10 to 20 points in crowd scenes or datasets with sparse annotation
Domain shift from curated validation to real world deployment can drop mAP from 0.55 to 0.35 due to nighttime conditions, motion blur, occlusion, or long tail events with insufficient training examples
📌 Examples
COCO small object performance: Models with overall AP@[.5:.95] of 0.50 often show small object AP of only 0.30, exposed by reporting separate metrics for objects under 32x32 pixels
Tesla nighttime detection drop: Validation mAP of 0.55 in daytime can fall to 0.35 at night due to domain shift, requiring separate night dataset evaluation and augmentation during training
Meta content moderation: Overconfident scores from poorly calibrated models pass 95% precision offline (AP based) but fail at 80% precision in production at fixed threshold, requiring temperature scaling recalibration
← Back to Evaluation (mAP, IoU, Precision-Recall) Overview
Evaluation Failure Modes and Metric Gaming Risks | Evaluation (mAP, IoU, Precision-Recall) - System Overflow