Computer Vision Systems • Evaluation (mAP, IoU, Precision-Recall)Medium⏱️ ~2 min
Choosing Metrics and Protocols for Your Task
Metric choice must align with downstream use. AP captures ranking quality across thresholds, making it ideal for model comparison and research. But if you operate at a fixed threshold in production, F1 score at that point is more actionable. For counting tasks, mean absolute error directly measures what you care about. For tracking applications, metrics like Identification F1 Score (IDF1) and Multiple Object Tracking Accuracy (MOTA) matter more than detection AP because they measure identity consistency over time.
IoU threshold selection reflects task geometry. Use IoU 0.5 for indexing and retrieval where rough location suffices. Google's image search or Amazon's visual product search likely operate here. Use IoU 0.75 or higher when geometry feeds downstream control, such as robotic pick points or autonomous vehicle trajectory planning. Tesla's pedestrian detection probably requires 0.75+ because poor localization creates unsafe control decisions.
AP protocol choice affects comparability. PASCAL VOC style AP at 0.5 makes numbers look better than COCO style AP averaged across 0.50 to 0.95. Reporting only AP at 0.5 can hide localization regressions that harm downstream tasks. Reporting only AP at [.5:.95] can miss recall improvements at IoU 0.5 that help indexing applications. Production teams report both, plus task specific operating points with concrete precision and recall targets.
Class weighting matters when distribution is skewed. Macro mAP gives equal weight per class, exposing long tail weakness. If your business is dominated by a few classes, weighted AP by class frequency better reflects impact. Amazon's retail system might weight the top 100 SKUs heavily since they represent 80% of volume. Finally, couple metrics to end task outcomes. Google and Meta often publish both detector AP and user facing metrics like retrieval quality, click through rate, or navigation success to ensure that AP gains translate to real improvements.
💡 Key Takeaways
•AP is ideal for model comparison across thresholds, but fixed threshold F1 score is more actionable for production systems operating at a single operating point
•IoU 0.5 suits indexing and retrieval (image search, product recognition), while IoU 0.75+ is required for control tasks (robotics, autonomous driving) where localization feeds actuators
•PASCAL VOC AP at 0.5 produces scores 15 to 30 points higher than COCO AP at [.5:.95]; always report both plus task specific metrics to avoid hidden regressions
•Macro mAP treats all classes equally and exposes long tail weakness, while weighted mAP by class frequency aligns with business impact when distribution is skewed
•Couple detection metrics to end task outcomes: track both AP and user facing metrics like click through rate, retrieval precision, or navigation success to validate real world impact
📌 Examples
Amazon warehouse robotics: Uses IoU 0.75 for box detection feeding pick point calculation, where 5cm localization error causes 30% grasp failure rate, plus tracks mean absolute error on package counts
Google image search: Operates at IoU 0.5 for object based retrieval, reporting precision at 90% recall as primary production metric alongside offline AP at 0.5 for model selection
Meta ads safety: Reports both AP at [.5:.95] for model quality (0.55+) and precision at 98% for production serving, coupling with human review rate reduction (50% improvement) to measure business impact