Computer Vision SystemsObject Detection (R-CNN, YOLO, Single-stage vs Two-stage)Hard⏱️ ~3 min

Video Optimization and Multi Camera Deployment Strategies

Production video systems optimize detection with temporal strategies and careful resource allocation. Running a detector at full frame rate is wasteful: adjacent frames are highly correlated. A common pattern detects at 5 to 10 Hertz and tracks at full frame rate with Kalman filters plus IoU based association or re identification embeddings. Detection every third frame at 30 frames per second cuts detector calls by 67 percent with minimal quality loss for moderate motion. A motion trigger forces detection when confidence decays below threshold or motion exceeds a limit. Multi camera retail safety illustrates the trade-offs. An edge box ingests 16 streams at 1080p 30 frames per second. A 33 millisecond per frame budget leaves 15 to 20 milliseconds for detection after capture, resize, and non maximum suppression. A YOLO class model in FP16 on an edge GPU delivers 25 to 40 frames per second per stream at 640 pixel input with throughput oriented scheduling. Detecting at 10 Hertz and tracking at 30 Hertz meets latency while handling 16 concurrent streams. For automotive perception, systems process 6 to 8 cameras with strict real time guarantees. A perception cycle of 20 to 33 milliseconds feeds tracking and planning. Specialized accelerators and tight memory layouts avoid copies, which can add several milliseconds per frame. Tesla's on board systems run dense single stage detection heads at tens of frames per second per camera. The detector must output calibrated confidence and stable boxes across frames for downstream tracking. Serving patterns differ by workload. Offline batch jobs use static batching of 8 to 32 images, prefetch with pinned memory, and keep preprocessing on GPU. A single data center GPU sustains 200 to 400 images per second. Online systems use micro batching of 2 to 4 with small queue timeouts or pure batch size 1 for strict p99. Isolating pre and post processing to separate CPU threads eliminates head of line blocking. Capping proposal candidates before NMS bounds tail latency to prevent 99th percentile spikes.
💡 Key Takeaways
Detecting at 10 Hertz and tracking at 30 Hertz cuts detector compute by 67 percent with minimal quality loss for moderate motion scenes
Multi camera edge systems processing 16 streams at 1080p 30 frames per second use YOLO in FP16 at 640 pixel input to deliver 25 to 40 frames per second per stream
Automotive perception with 6 to 8 cameras requires 20 to 33 millisecond cycle times, using specialized accelerators and avoiding tensor copies that add several milliseconds
Offline batch jobs achieve 200 to 400 images per second per GPU with static batching of 8 to 32 images, prefetch, and GPU side preprocessing
Online serving uses micro batching of 2 to 4 or batch size 1, isolates pre and post processing on CPU threads, and caps NMS candidates to bound p99 latency
📌 Examples
Retail warehouse safety system detecting at 10 Hertz on 16 cameras, tracking with Kalman filter at 30 Hertz, cutting GPU load by 67 percent while maintaining temporal continuity
Tesla automotive stack processing 8 cameras at 30 frames per second with 25 millisecond perception cycle on custom accelerators, using single stage detectors with tight memory layouts
Google Photos batch indexing 100 million images per day with cluster of GPUs at 10,000 images per second throughput using static batching of 16 to 32 images per GPU
← Back to Object Detection (R-CNN, YOLO, Single-stage vs Two-stage) Overview