Computer Vision Systems • Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)Medium⏱️ ~2 min
Two Stage Detectors: R-CNN Family Evolution and Performance
Two stage detectors separate the problem into region proposal generation followed by classification and box refinement. The Region based Convolutional Neural Network (R CNN) family evolved through three major iterations, each addressing critical bottlenecks. Original R CNN used selective search to generate around 2000 region proposals per image, then extracted CNN features independently for each region. This took approximately 47 seconds per image, making it impractical for production.
Fast R CNN improved this by sharing a single backbone CNN across the entire image and using Region of Interest (ROI) pooling to extract features for each proposal from the shared feature map. Training became end to end, but selective search remained a bottleneck. Faster R CNN introduced the Region Proposal Network (RPN), which learns to generate proposals directly from CNN features using anchor boxes at multiple scales and aspect ratios. The RPN typically proposes 300 regions per image, and the detector head classifies and refines them.
Faster R CNN achieves 5 to 10 frames per second on a high end data center Graphics Processing Unit (GPU) with input images at 600 to 800 pixel shorter side. The two stage design delivers strong accuracy, particularly for small objects, crowded scenes, and long tail classes. Meta's content understanding systems have used Faster R CNN and RetinaNet variants through Detectron2 for offline media processing, where 200 to 400 millisecond latency per image is acceptable in exchange for higher mAP.
The tradeoff is complexity and latency. Two stage pipelines require aligned feature maps, proposal sampling logic, and ROI operations that are harder to optimize for edge devices. Quantization to INT8 can lose more accuracy than with single stage models because the ROI pooling and multi scale feature handling introduce numerical sensitivity.
💡 Key Takeaways
•Original R CNN took 47 seconds per image with selective search and independent CNN passes per region, completely impractical for production
•Faster R CNN with Region Proposal Network achieves 5 to 10 frames per second on high end GPUs at 600 to 800 pixel input with 300 proposals per image
•Two stage design excels at small objects, crowded scenes, and long tail classes due to focused ROI processing and refined feature extraction
•Meta uses Faster R CNN and RetinaNet variants in Detectron2 for offline content understanding where 200 to 400 millisecond latency is acceptable
•Deployment complexity is higher than single stage models: ROI operations and multi scale features are harder to quantize and optimize for edge devices
📌 Examples
Medical imaging review using Faster R CNN to detect small lesions and nodules with high recall requirements, tolerating 500 millisecond latency
Meta Detectron2 processing uploaded photos offline for content moderation and search indexing with two stage models for maximum mAP
Retail inventory system scanning crowded shelves with Faster R CNN to distinguish overlapping products, achieving better precision than single stage alternatives