Computer Vision SystemsEvaluation (mAP, IoU, Precision-Recall)Hard⏱️ ~2 min

Production Evaluation Pipelines: Scale, Cost, and Operating Points

Research metrics like mAP tell you how well a model ranks detections, but production systems need different evaluation approaches. You must measure what actually matters to users and business outcomes at scale.

⚖️ RESEARCH VS PRODUCTION METRICS

Research Metrics
mAP across all classes
IoU thresholds
Offline benchmark scores
Fixed test datasets
Production Metrics
Critical class recall
False positive rate per hour
Latency percentiles
User-reported issues

🎯 CHOOSING YOUR OPERATING POINT

The PR curve gives you many possible operating points. Choosing the right one depends on your failure costs:

  • High recall operating point: Accept more false positives to catch nearly everything. Use when misses are expensive (safety systems, security alerts).
  • High precision operating point: Only surface confident detections. Use when false alarms cause user frustration or wasted downstream processing.
  • Balanced operating point: F1 score maximization. Use when both types of errors are roughly equally costly.

📊 PRODUCTION EVALUATION PIPELINE

A robust evaluation system runs continuously, not just before deployment:

Ground truth collection: Sample production predictions for human review. Label a representative slice daily or weekly. Track labeler agreement to catch ambiguous cases.

Slice analysis: Break down metrics by scene type, lighting conditions, object size, and geographic region. A model with 90% overall mAP might have 60% mAP on night scenes - slice analysis reveals these gaps.

Regression detection: Compare new model versions against baselines on the same evaluation set. Flag regressions in any critical slice before deployment.

⚠️ Cost Reality: Human labeling is expensive. Budget 2-5 minutes per image for bounding boxes, longer for segmentation. Plan labeling costs into your evaluation pipeline design - you cannot evaluate what you cannot label.
💡 Key Takeaways
Production metrics differ from research metrics - measure false positives per hour, critical class recall, and latency
Operating point selection depends on failure costs: high recall for safety, high precision for user experience
Slice analysis reveals hidden weaknesses - overall mAP can mask poor performance on specific conditions
Continuous evaluation with fresh ground truth catches drift that static benchmarks miss
📌 Interview Tips
1Interview Tip: When asked about metric choice, explain the cost asymmetry - missing a pedestrian vs false-alarming on a shadow have very different consequences
2Interview Tip: Mention labeling costs as a practical constraint - sophisticated evaluation requires ground truth, and ground truth requires human time
← Back to Evaluation (mAP, IoU, Precision-Recall) Overview