Production Evaluation Pipelines: Scale, Cost, and Operating Points
Research metrics like mAP tell you how well a model ranks detections, but production systems need different evaluation approaches. You must measure what actually matters to users and business outcomes at scale.
⚖️ RESEARCH VS PRODUCTION METRICS
mAP across all classes
IoU thresholds
Offline benchmark scores
Fixed test datasets
Critical class recall
False positive rate per hour
Latency percentiles
User-reported issues
🎯 CHOOSING YOUR OPERATING POINT
The PR curve gives you many possible operating points. Choosing the right one depends on your failure costs:
- High recall operating point: Accept more false positives to catch nearly everything. Use when misses are expensive (safety systems, security alerts).
- High precision operating point: Only surface confident detections. Use when false alarms cause user frustration or wasted downstream processing.
- Balanced operating point: F1 score maximization. Use when both types of errors are roughly equally costly.
📊 PRODUCTION EVALUATION PIPELINE
A robust evaluation system runs continuously, not just before deployment:
Ground truth collection: Sample production predictions for human review. Label a representative slice daily or weekly. Track labeler agreement to catch ambiguous cases.
Slice analysis: Break down metrics by scene type, lighting conditions, object size, and geographic region. A model with 90% overall mAP might have 60% mAP on night scenes - slice analysis reveals these gaps.
Regression detection: Compare new model versions against baselines on the same evaluation set. Flag regressions in any critical slice before deployment.