Computer Vision Systems • Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)Hard⏱️ ~3 min
Failure Modes and Edge Cases in Production Object Detection
Production object detection systems fail in predictable ways that require defensive engineering. Crowded scenes with overlapping instances of the same class stress non maximum suppression. When 20 people stand close together, candidate boxes have high Intersection over Union (IoU) overlap. Greedy NMS with 0.5 IoU threshold suppresses true positives, reducing recall. Soft NMS or class agnostic NMS helps, but NMS latency can spike to 30 percent of end to end time when thousands of candidates survive confidence filtering. Careful threshold tuning is essential.
Small and thin objects expose architectural limits. Traffic lights or distant pedestrians occupying 8×8 pixels at 640 pixel input fall below effective receptive field size for single stage detectors with stride 32 feature maps. Two stage detectors fail if Region Proposal Network anchors mismatch object scales. Solutions include higher input resolution, which quadruples compute, or Feature Pyramid Networks with stride 4 to 8 features, adding 20 to 40 percent latency.
Motion blur and low light degrade features. Detectors trained on crisp daytime images lose 10 to 20 percent recall at night or in rain. Automotive systems see false positives on reflections and glare, and miss pedestrians in shadows. Temporal aggregation across frames and exposure aware augmentation during training mitigate this. Domain shift is pervasive: a model trained on one camera type or geography miscalibrates when aspect ratios, viewpoints, or backgrounds change. Online calibration monitoring with drift detection and periodic fine tuning are required.
Class imbalance floods single stage detectors with easy negatives. Without focal loss or hard example mining, models become overconfident on background and underconfident on rare classes. Adversarial patterns also emerge: high contrast stickers or LED displays trigger or suppress detections. Content platforms report printed patches that hide objects from simple detectors. Defense in depth with multiple viewpoints or modalities is necessary. Finally, resource pressure on edge causes thermal throttling and frame drops, and CPU to GPU tensor copies add several milliseconds, violating latency budgets.
💡 Key Takeaways
•Crowded scenes with overlapping instances cause NMS to suppress true positives and spike to 30 percent of end to end latency with thousands of candidates
•Small objects at 8×8 pixels fall below receptive field for stride 32 features, requiring higher resolution that quadruples compute or Feature Pyramid Networks adding 20 to 40 percent latency
•Motion blur and low light reduce recall by 10 to 20 percent and cause false positives on reflections, requiring temporal aggregation and exposure aware training augmentation
•Domain shift from camera type or geography changes miscalibrates confidence, requiring online drift detection and periodic fine tuning to maintain accuracy
•Class imbalance without focal loss or hard example mining makes models overconfident on background and underconfident on rare classes, losing precision on long tail
📌 Examples
Autonomous vehicle detector missing pedestrians in shadows at night, requiring infrared camera fusion and temporal aggregation across 5 to 10 frames
Retail safety system with false positives on LED display reflections, fixed by training with adversarial augmentation and multi viewpoint verification
Content moderation model trained on US data failing on Asian markets with different camera angles and lighting, requiring geographic specific fine tuning