Computer Vision Systems • Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)Medium⏱️ ~2 min
Single Stage Detectors: YOLO, SSD, and Real Time Performance
Single stage detectors predict bounding boxes and class probabilities directly from dense feature maps in one forward pass, eliminating the separate proposal generation stage. You Only Look Once (YOLO) divides the input image into a grid, typically 13×13 or 19×19 cells, and each cell regresses boxes and predicts classes. Single Shot MultiBox Detector (SSD) and RetinaNet use predefined anchor boxes at multiple scales and aspect ratios across multi scale feature maps, often built with Feature Pyramid Networks (FPN).
The performance difference is dramatic. Single stage models reach 30 to 100 plus frames per second on commodity GPUs with input sizes around 512 to 640 pixels, compared to 5 to 10 frames per second for two stage detectors. A well optimized single stage model in FP16 on a data center GPU achieves 7 to 12 milliseconds per image, and 20 to 40 milliseconds on edge GPUs. This makes them practical for real time applications: retail safety systems processing 16 camera streams at 30 frames per second, or mobile augmented reality overlays.
The accuracy gap has narrowed significantly. Early YOLO models traded 5 to 10 mAP points for speed, but modern variants with stronger backbones, Feature Pyramid Networks, focal loss to handle class imbalance, and anchor free designs approach two stage accuracy. RetinaNet introduced focal loss to down weight easy negatives, solving the extreme foreground background imbalance that plagued earlier single stage models. Anchor free heads like in YOLO version 5 and later predict center points and scales directly, reducing hyperparameter tuning.
Tesla's automotive perception stack uses single stage detectors on specialized accelerators to process 6 to 8 cameras at tens of frames per second per camera with strict 20 to 33 millisecond perception cycle budgets. The simplicity of single stage models also makes them easier to quantize to INT8 and deploy on edge devices, typically losing only 0.5 to 2 mAP points compared to 2 to 4 points for two stage models.
💡 Key Takeaways
•Single stage models achieve 30 to 100 plus frames per second at 512 to 640 pixel input versus 5 to 10 frames per second for two stage detectors
•Modern single stage detectors reach 7 to 12 milliseconds on data center GPUs in FP16 and 20 to 40 milliseconds on edge GPUs, meeting real time requirements
•Focal loss introduced by RetinaNet solves extreme class imbalance by down weighting easy negatives, closing the accuracy gap with two stage models
•Tesla uses single stage detectors on accelerators for 6 to 8 camera automotive perception within 20 to 33 millisecond cycle budgets
•Single stage models quantize to INT8 more easily, losing 0.5 to 2 mAP points versus 2 to 4 points for two stage models, critical for edge deployment
📌 Examples
Retail safety edge box processing 16 camera streams at 30 frames per second using YOLO in FP16, finishing detection in 15 to 20 milliseconds per frame
Mobile augmented reality app running single stage detector at 30 frames per second on phone Neural Processing Unit (NPU) with 20 to 30 millisecond latency
Amazon warehouse robot navigation using SSD with Feature Pyramid Network for real time obstacle detection at 60 frames per second