Computer Vision SystemsObject Detection (R-CNN, YOLO, Single-stage vs Two-stage)Easy⏱️ ~2 min

What is Object Detection and How Does It Differ From Classification?

Object detection combines two tasks: assigning class labels to objects and localizing them with bounding boxes in an image or video frame. Unlike image classification that outputs a single label for the entire image, detection predicts multiple boxes, each with a class probability vector and spatial coordinates. For example, classification might label an image as "street scene", while detection identifies 5 cars, 3 pedestrians, and 2 traffic lights with precise box coordinates for each. Modern detectors parameterize box coordinates as offsets from predefined anchors or as direct predictions of center point, width, and height. The model outputs thousands of candidate boxes, and non maximum suppression (NMS) removes duplicates by comparing Intersection over Union (IoU). A box with 0.7 IoU overlap with a higher confidence box of the same class gets suppressed. The primary quality metric is mean Average Precision (mAP), computed at one or multiple IoU thresholds. Common thresholds include 0.5, 0.75, or the COCO metric that averages from 0.5 to 0.95 in 0.05 steps. A detector with 42.5 mAP on COCO at IoU 0.5 to 0.95 is considered strong. For production systems, end to end latency matters just as much: real time applications need 15 to 33 milliseconds per frame, while offline batch jobs can tolerate 200 to 400 milliseconds per image.
💡 Key Takeaways
Detection outputs multiple bounding boxes with class labels and confidence scores, while classification produces a single image level label
Box coordinates are typically parameterized as offsets from anchors or direct center, width, height predictions from grid cells
Non maximum suppression removes duplicate detections by suppressing boxes with high IoU overlap, commonly using 0.5 to 0.7 IoU threshold
Mean Average Precision at multiple IoU thresholds is the standard quality metric, with COCO mAP averaging from 0.5 to 0.95 IoU
Real time systems require 15 to 33 milliseconds per frame latency for 30 to 60 frames per second, while offline batch jobs tolerate 200 to 400 milliseconds
📌 Examples
Retail safety camera system detecting people, forklifts, and spills in warehouse with box coordinates for each object
Autonomous vehicle perception identifying 12 cars, 4 pedestrians, 3 cyclists with precise localization for path planning
Photo platform indexing 100 million uploaded images per day, extracting objects like faces, products, landmarks with bounding boxes for search
← Back to Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) Overview
What is Object Detection and How Does It Differ From Classification? | Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) - System Overflow