Computer Vision SystemsObject Detection (R-CNN, YOLO, Single-stage vs Two-stage)Easy⏱️ ~2 min

What is Object Detection and How Does It Differ From Classification?

Definition
Object Detection locates and identifies multiple objects in an image by predicting both bounding boxes (where objects are) and class labels (what objects are). Unlike classification which outputs one label per image, detection outputs multiple boxes with labels and confidence scores.

Classification vs Detection

Classification: Input is an image, output is a single label. Is this a cat or a dog? The model assumes one primary object fills the frame.

Detection: Input is an image, output is a list of (box, label, confidence) tuples. Where are all the cars, pedestrians, and traffic signs? Each object gets its own bounding box, class prediction, and confidence score.

The Detection Pipeline

Every detector must solve two problems: localization (where is the object?) and classification (what is it?). The core challenge is handling an unknown number of objects at unknown locations without exhaustively checking every possible box.

Anchor boxes: Most detectors pre-define a grid of reference boxes at multiple scales and aspect ratios. The model predicts adjustments to these anchors rather than raw coordinates. A 416x416 image might have 10,000+ anchor boxes, each a potential detection.

Key Metrics

IoU (Intersection over Union): Measures how well a predicted box overlaps with the ground truth. IoU of 0.5 means 50% overlap, typically the minimum for a correct detection.

mAP (mean Average Precision): Summarizes precision and recall across all classes and confidence thresholds. [email protected] uses 50% IoU threshold; mAP@[0.5:0.95] averages across stricter thresholds.

💡 Key Takeaways
Detection outputs (box, label, confidence) tuples for multiple objects; classification outputs one label per image
Anchor boxes pre-define candidate locations - a typical image has 10,000+ anchors to evaluate
IoU measures box overlap quality; 0.5 is minimum acceptable, 0.75+ indicates precise localization
mAP summarizes detection quality across classes and thresholds - the standard benchmark metric
📌 Interview Tips
1Interview Tip: Clarify whether the task needs detection or classification first - the architectures and compute requirements differ by 10x
2Interview Tip: Mention IoU threshold when discussing accuracy - [email protected] vs [email protected] can differ by 20+ points
← Back to Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) Overview
What is Object Detection and How Does It Differ From Classification? | Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) - System Overflow