What is Object Detection and How Does It Differ From Classification?
Classification vs Detection
Classification: Input is an image, output is a single label. Is this a cat or a dog? The model assumes one primary object fills the frame.
Detection: Input is an image, output is a list of (box, label, confidence) tuples. Where are all the cars, pedestrians, and traffic signs? Each object gets its own bounding box, class prediction, and confidence score.
The Detection Pipeline
Every detector must solve two problems: localization (where is the object?) and classification (what is it?). The core challenge is handling an unknown number of objects at unknown locations without exhaustively checking every possible box.
Anchor boxes: Most detectors pre-define a grid of reference boxes at multiple scales and aspect ratios. The model predicts adjustments to these anchors rather than raw coordinates. A 416x416 image might have 10,000+ anchor boxes, each a potential detection.
Key Metrics
IoU (Intersection over Union): Measures how well a predicted box overlaps with the ground truth. IoU of 0.5 means 50% overlap, typically the minimum for a correct detection.
mAP (mean Average Precision): Summarizes precision and recall across all classes and confidence thresholds. [email protected] uses 50% IoU threshold; mAP@[0.5:0.95] averages across stricter thresholds.