Computer Vision Systems • Object Detection (R-CNN, YOLO, Single-stage vs Two-stage)Hard⏱️ ~3 min
Video Optimization and Multi Camera Deployment Strategies
Production video systems optimize detection with temporal strategies and careful resource allocation. Running a detector at full frame rate is wasteful: adjacent frames are highly correlated. A common pattern detects at 5 to 10 Hertz and tracks at full frame rate with Kalman filters plus IoU based association or re identification embeddings. Detection every third frame at 30 frames per second cuts detector calls by 67 percent with minimal quality loss for moderate motion. A motion trigger forces detection when confidence decays below threshold or motion exceeds a limit.
Multi camera retail safety illustrates the trade-offs. An edge box ingests 16 streams at 1080p 30 frames per second. A 33 millisecond per frame budget leaves 15 to 20 milliseconds for detection after capture, resize, and non maximum suppression. A YOLO class model in FP16 on an edge GPU delivers 25 to 40 frames per second per stream at 640 pixel input with throughput oriented scheduling. Detecting at 10 Hertz and tracking at 30 Hertz meets latency while handling 16 concurrent streams.
For automotive perception, systems process 6 to 8 cameras with strict real time guarantees. A perception cycle of 20 to 33 milliseconds feeds tracking and planning. Specialized accelerators and tight memory layouts avoid copies, which can add several milliseconds per frame. Tesla's on board systems run dense single stage detection heads at tens of frames per second per camera. The detector must output calibrated confidence and stable boxes across frames for downstream tracking.
Serving patterns differ by workload. Offline batch jobs use static batching of 8 to 32 images, prefetch with pinned memory, and keep preprocessing on GPU. A single data center GPU sustains 200 to 400 images per second. Online systems use micro batching of 2 to 4 with small queue timeouts or pure batch size 1 for strict p99. Isolating pre and post processing to separate CPU threads eliminates head of line blocking. Capping proposal candidates before NMS bounds tail latency to prevent 99th percentile spikes.
💡 Key Takeaways
•Detecting at 10 Hertz and tracking at 30 Hertz cuts detector compute by 67 percent with minimal quality loss for moderate motion scenes
•Multi camera edge systems processing 16 streams at 1080p 30 frames per second use YOLO in FP16 at 640 pixel input to deliver 25 to 40 frames per second per stream
•Automotive perception with 6 to 8 cameras requires 20 to 33 millisecond cycle times, using specialized accelerators and avoiding tensor copies that add several milliseconds
•Offline batch jobs achieve 200 to 400 images per second per GPU with static batching of 8 to 32 images, prefetch, and GPU side preprocessing
•Online serving uses micro batching of 2 to 4 or batch size 1, isolates pre and post processing on CPU threads, and caps NMS candidates to bound p99 latency
📌 Examples
Retail warehouse safety system detecting at 10 Hertz on 16 cameras, tracking with Kalman filter at 30 Hertz, cutting GPU load by 67 percent while maintaining temporal continuity
Tesla automotive stack processing 8 cameras at 30 frames per second with 25 millisecond perception cycle on custom accelerators, using single stage detectors with tight memory layouts
Google Photos batch indexing 100 million images per day with cluster of GPUs at 10,000 images per second throughput using static batching of 16 to 32 images per GPU