Two Stage Detectors: R-CNN Family Evolution and Performance
Two Stage Detection Approach
Two stage detectors separate localization and classification into distinct steps. First, propose regions that might contain objects. Second, classify each proposal and refine its bounding box. This separation allows each stage to specialize.
Stage 1: Region Proposal
The Region Proposal Network (RPN) scans the image and outputs 1,000-2,000 candidate boxes likely to contain objects. It does not classify objects yet, only determines objectness: is there something here worth examining closely?
How it works: A small network slides over feature maps from the backbone. At each location, it predicts whether each anchor box contains an object and adjusts anchor coordinates. Non-maximum suppression reduces overlapping proposals to a manageable set.
Stage 2: Classification and Refinement
For each proposal, extract features using RoI pooling (crop and resize the feature map region). Feed these features through classification and regression heads. The classification head predicts object class. The regression head refines bounding box coordinates for tighter fit.
Per-proposal processing: Each of 1,000+ proposals requires a forward pass through the second stage heads. This is where two stage detectors spend most of their compute, making them slower than single stage alternatives.
Performance Characteristics
Accuracy: Two stage detectors typically achieve 2-5% higher mAP than single stage models, especially on small objects and crowded scenes.
Speed: 50-200ms per image on modern GPUs. The per-proposal processing creates a speed ceiling that limits real-time applications.