Single Stage Detectors: YOLO, SSD, and Real Time Performance
Single Stage Detection Approach
Single stage detectors predict bounding boxes and class labels in one pass through the network. No separate region proposal step. The model directly outputs final detections from a dense grid of anchor boxes. This unified approach trades some accuracy for significant speed gains.
How Single Stage Works
Divide the image into a grid (e.g., 13x13 cells). Each cell predicts B bounding boxes (typically 3-9 per cell). For each box, predict: center offset from cell, width, height, objectness score, and class probabilities. All predictions happen simultaneously in one forward pass.
Dense prediction: A 416x416 image with three scales might produce 10,000+ predictions. Most are background. Non-maximum suppression filters overlapping detections and low confidence predictions to produce final output.
YOLO Architecture Principles
YOLO (You Only Look Once) processes the entire image globally rather than examining regions sequentially. The network sees full context when making predictions. This helps with objects that span multiple regions and reduces false positives from partial views.
Multi-scale detection: Modern YOLO versions predict at multiple feature map resolutions. Small objects detected at high resolution maps, large objects at low resolution. This addresses early YOLO versions weakness on small object detection.
Performance Characteristics
Speed: 5-30ms per image on modern GPUs. Real-time detection at 30-60+ FPS is achievable.
Accuracy: 2-5% lower mAP than two stage detectors on challenging benchmarks. The gap narrows with newer architectures and larger backbones.