Learn→Computer Vision Systems→Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)→2 of 6

Computer Vision Systems • Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)Medium⏱️ ~2 min

Two Stage Detectors: R-CNN Family Evolution and Performance

Two Stage Detection Approach
Two stage detectors separate localization and classification into distinct steps. First, propose regions that might contain objects. Second, classify each proposal and refine its bounding box. This separation allows each stage to specialize.
Stage 1: Region Proposal
The Region Proposal Network (RPN) scans the image and outputs 1,000-2,000 candidate boxes likely to contain objects. It does not classify objects yet, only determines objectness: is there something here worth examining closely?
How it works: A small network slides over feature maps from the backbone. At each location, it predicts whether each anchor box contains an object and adjusts anchor coordinates. Non-maximum suppression reduces overlapping proposals to a manageable set.
Stage 2: Classification and Refinement
For each proposal, extract features using RoI pooling (crop and resize the feature map region). Feed these features through classification and regression heads. The classification head predicts object class. The regression head refines bounding box coordinates for tighter fit.
Per-proposal processing: Each of 1,000+ proposals requires a forward pass through the second stage heads. This is where two stage detectors spend most of their compute, making them slower than single stage alternatives.
Performance Characteristics
Accuracy: Two stage detectors typically achieve 2-5% higher mAP than single stage models, especially on small objects and crowded scenes.
Speed: 50-200ms per image on modern GPUs. The per-proposal processing creates a speed ceiling that limits real-time applications.

💡 Key Takeaways

✓Two stages: RPN proposes 1000-2000 candidate regions, then heads classify and refine each proposal

✓RPN determines objectness (is something here?) without classifying what it is

✓Per-proposal processing through second stage heads is the computational bottleneck

✓Two stage detectors achieve 2-5% higher mAP than single stage but run at 50-200ms per image

📌 Interview Tips

1Interview Tip: Explain the two stage design as accuracy-optimized - each proposal gets individual attention

2Interview Tip: Mention RoI pooling as the key technique that enables per-proposal feature extraction

← Back to Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) Overview