Choosing Metrics and Protocols for Your Task
There is no universally correct evaluation setup. The right metrics and protocols depend on your task requirements, deployment constraints, and what failure modes matter most for your users.
🎯 MATCHING METRICS TO TASK REQUIREMENTS
Detection vs counting: If you need accurate object counts (crowd estimation, inventory), optimize for recall and track count error. Missing objects hurts more than imprecise boxes.
Detection vs tracking: If objects persist across frames (video surveillance, autonomous driving), add tracking metrics like MOTA and ID switches. A detector with perfect per-frame mAP might still produce jittery tracks.
Detection vs segmentation: If you need precise boundaries (medical imaging, satellite analysis), IoU at the pixel level matters more than bounding box IoU. Report mask AP separately.
📋 DESIGNING YOUR EVALUATION PROTOCOL
Dataset construction: Sample from your actual production distribution. Include hard cases proportionally - do not oversample edge cases or your metrics will be pessimistic. Document inclusion criteria so future evaluations are comparable.
Labeling guidelines: Write explicit rules for ambiguous cases. Is a partially visible object labeled? What about reflections? How much occlusion before an object is not labeled? Ambiguity creates label noise.
Reproducibility: Fix random seeds, document preprocessing, and version your evaluation code alongside your model code. A metric improvement means nothing if you cannot reproduce it.
⚡ PRACTICAL PROTOCOL DECISIONS
- Evaluation set size: Minimum 1000 images for stable mAP estimates. More for rare classes or when comparing models with small performance differences.
- Confidence intervals: Report standard deviation across multiple runs or bootstrap samples. A 0.5% mAP difference is noise, not signal.
- Latency measurement: Warm up the model, exclude first batch, measure on representative hardware. GPU cold start can add 100ms+ that disappears in steady state.