Computer Vision SystemsReal-time Video ProcessingHard⏱️ ~2 min

City Scale Video Analytics System Design

City scale video analytics must process thousands of camera streams with strict cost and latency constraints. Consider a system with 10,000 cameras streaming 1080p at 15 frames per second (FPS), H.264 encoded at 2 Mbps average, generating roughly 20 Gbps total ingest bandwidth. Cameras connect to regional gateways over Real Time Streaming Protocol (RTSP) or WebRTC, which then push to managed ingest like Amazon Kinesis Video Streams for secure retention and fan out to analytics consumers. The economics force aggressive downsampling. Processing every frame from every stream is prohibitively expensive. A common strategy is temporal downsampling to 5 FPS for detection paired with lightweight motion gating to skip static scenes. A mid range data center GPU running a small object detector at 640p with INT8 quantization and hardware decode can process on the order of 100 to 300 FPS, meaning 100 to 200 GPUs are required to cover 10,000 streams at 5 FPS depending on model complexity and pre/post-processing overhead. Buffering discipline is critical at this scale. Each pipeline stage maintains 2 to 4 frames per queue with strict drop oldest policies when queues fill. End to end alert latency targets stay under 1 second, dominated by detection inference time and inter-stage queuing. Meta and Amazon deploy similar architectures for content moderation and highlights extraction, accepting a few seconds of delay in exchange for horizontal scalability. Adaptive load shedding keeps the system stable under variable load. When queue depths exceed thresholds or p95 latency degrades, the system reduces work in priority order: lower resolution from 1080p to 720p, reduce FPS from 5 to 3, disable expensive post-processing like multi-object tracking, or drop to event only outputs. This graceful degradation prevents cascading failures where one overloaded stage collapses the entire pipeline.
💡 Key Takeaways
Ten thousand 1080p15 streams at 2 Mbps generate 20 Gbps ingest requiring regional gateway architecture with managed services like Kinesis Video Streams
Temporal downsampling to 5 FPS with motion gating reduces processing cost by 66% while maintaining detection coverage for dynamic scenes
Mid range GPU processes 100 to 300 FPS of small detector at 640p with INT8 quantization, requiring 100 to 200 GPUs for 10K streams at 5 FPS
Bounded queues of 2 to 4 frames per stage with drop oldest policy keep end to end alert latency under 1 second
Adaptive load shedding degrades gracefully: reduce resolution to 720p, lower to 3 FPS, disable tracking, or switch to event only output when p95 latency exceeds thresholds
YouTube Live and Twitch use similar fan out architectures running moderation in sidecar pipelines, accepting 2 to 5 seconds glass to glass latency for horizontal scalability
📌 Examples
Amazon city scale traffic monitoring downsamples from 1080p15 to 640p5, applies motion detection to skip 40% of static frames, reducing compute cost by 70% while maintaining incident detection recall above 95%
Meta content moderation pipeline processes millions of live streams with 2 to 4 frame queues, automatically reducing resolution and FPS during peak load to maintain sub 2 second violation detection latency
← Back to Real-time Video Processing Overview
City Scale Video Analytics System Design | Real-time Video Processing - System Overflow