Production Evaluation Pipelines: Scale, Cost, and Operating Points

Research metrics like mAP tell you how well a model ranks detections, but production systems need different evaluation approaches. You must measure what actually matters to users and business outcomes at scale.
⚖️ RESEARCH VS PRODUCTION METRICS
Research Metrics

mAP across all classes
IoU thresholds
Offline benchmark scores
Fixed test datasets

Production Metrics

Critical class recall
False positive rate per hour
Latency percentiles
User-reported issues
🎯 CHOOSING YOUR OPERATING POINTThe PR curve gives you many possible operating points. Choosing the right one depends on your failure costs:
High recall operating point: Accept more false positives to catch nearly everything. Use when misses are expensive (safety systems, security alerts).
High precision operating point: Only surface confident detections. Use when false alarms cause user frustration or wasted downstream processing.
Balanced operating point: F1 score maximization. Use when both types of errors are roughly equally costly.
📊 PRODUCTION EVALUATION PIPELINEA robust evaluation system runs continuously, not just before deployment:
Ground truth collection: Sample production predictions for human review. Label a representative slice daily or weekly. Track labeler agreement to catch ambiguous cases.
Slice analysis: Break down metrics by scene type, lighting conditions, object size, and geographic region. A model with 90% overall mAP might have 60% mAP on night scenes - slice analysis reveals these gaps.
Regression detection: Compare new model versions against baselines on the same evaluation set. Flag regressions in any critical slice before deployment.

⚠️ Cost Reality: Human labeling is expensive. Budget 2-5 minutes per image for bounding boxes, longer for segmentation. Plan labeling costs into your evaluation pipeline design - you cannot evaluate what you cannot label.

💡 Key Takeaways

✓Production metrics differ from research metrics - measure false positives per hour, critical class recall, and latency

✓Operating point selection depends on failure costs: high recall for safety, high precision for user experience

✓Slice analysis reveals hidden weaknesses - overall mAP can mask poor performance on specific conditions

✓Continuous evaluation with fresh ground truth catches drift that static benchmarks miss

📌 Interview Tips

1Interview Tip: When asked about metric choice, explain the cost asymmetry - missing a pedestrian vs false-alarming on a shadow have very different consequences

2Interview Tip: Mention labeling costs as a practical constraint - sophisticated evaluation requires ground truth, and ground truth requires human time

← Back to Evaluation (mAP, IoU, Precision-Recall) Overview