Computer Vision SystemsEvaluation (mAP, IoU, Precision-Recall)Hard⏱️ ~2 min

Production Evaluation Pipelines: Scale, Cost, and Operating Points

At scale, evaluation becomes an engineering challenge. A validation pass on 50,000 images with 300 predictions per image involves matching 15 million predicted boxes against 500,000 ground truths. Even with vectorized IoU computation taking hundreds of microseconds per image, the full dataset can take minutes on a single machine. Large companies evaluate daily on tens of millions of frames, requiring distributed computation. A retail vision system processing 1080p frames at 15 frames per second on edge devices might maintain a held out test set of 50,000 images with roughly 10 objects per image. The team reports COCO style AP at [.5:.95], per class AP, and per size AP offline in the cloud. At inference, the device must deliver 90% precision at 80% recall for the top 100 SKUs at IoU 0.5, within 30 milliseconds per frame on a mobile Graphics Processing Unit (GPU). This dual metric system aligns offline model quality assessment with online serving requirements. Evaluation directly drives threshold selection. The precision recall curve exposes the operating point. For Meta's ads safety filter, the team might pick the lowest confidence threshold achieving 98% precision for harmful content, then measure the resulting recall at 60 to 70%. For Amazon warehouse counting, they target 95% recall and accept 70% precision, using downstream spatial logic to suppress duplicates. Thresholds vary per class because score distributions differ. A safety critical pedestrian detector might use threshold 0.3, while a package label detector uses 0.7. Cost management matters. Cloud batch evaluation can process 1,000 images per second per node, but running daily validation on 10 million frames costs compute. Real time edge detectors must balance latency and accuracy. High end mobile Neural Processing Units (NPUs) run models under 10 to 20 milliseconds per frame but often achieve AP at [.5:.95] of only 0.35 to 0.45, compared to 0.55+ for cloud models. Teams at Google and Tesla track both detector AP and end task metrics like retrieval quality or navigation success to ensure AP gains translate to user outcomes.
💡 Key Takeaways
Evaluation at scale on 50,000 images with 300 predictions each involves matching 15 million boxes against 500,000 ground truths, taking minutes even with vectorized computation
Production systems maintain dual metrics: offline AP at [.5:.95] for model quality assessment (0.55+ state of the art) and online precision/recall at fixed thresholds for serving (90% precision at 80% recall typical)
Operating point selection uses the precision recall curve to find thresholds meeting service level objectives, varying per class based on score distributions and business impact
Edge inference trades accuracy for latency: mobile NPUs achieve 10 to 20ms per frame with AP at [.5:.95] of 0.35 to 0.45, while cloud models get 0.55+ but run slower
Large companies evaluate daily on tens of millions of frames, requiring distributed computation at thousands of images per second per node and careful cost management
📌 Examples
Tesla autonomous driving: Runs offline campaigns on millions of video clips to measure long tail recall for vulnerable road users in occlusion, tracking precision at 95%+ recall operating points for safety before deployment
Google Cloud Vision API: Processes 1,000 images per second per node for batch evaluation, reporting COCO AP at [.5:.95] above 0.55 while also tracking per class AP to diagnose failures on rare categories
Amazon warehouse counting system: Targets 95% recall at 70% precision with IoU 0.5, then applies spatial clustering downstream to remove duplicates within 10cm, reducing false positives by 40% post detection
← Back to Evaluation (mAP, IoU, Precision-Recall) Overview