Computer Vision Systems • Real-time Video ProcessingMedium⏱️ ~2 min
Real-Time Video Processing Pipeline Architecture
Real-time video processing is a chain of bounded resources where video flows from capture through decode, inference, and post-processing within strict latency budgets. Interactive applications like video calls target under 200 milliseconds one way, safety critical robot control loops need 30 to 100 milliseconds, while live analytics can tolerate 500 milliseconds to a few seconds.
The pipeline stages have distinct performance characteristics. Ingest and transport are network bound and sensitive to jitter. Decode is fixed cost per pixel and benefits heavily from hardware acceleration, handling 1080p30 in single digit milliseconds on dedicated units versus tens of milliseconds on CPU. Preprocessing and inference are compute bound, where batching improves GPU utilization from 30% to 85% but adds queuing delay. Post-processing like tracking is CPU and memory bound.
The critical principle is time aware flow control with bounded queues between stages. Google Meet keeps one way latency around 50 to 150 milliseconds by decimating frames, running inference on 1 of every 2 frames while interpolating results for smooth UI. When queues fill, the system drops oldest frames to always operate on freshest data, preventing buffer bloat where latency creeps from 100 milliseconds to multiple seconds during transient spikes.
Protocol choice fundamentally constrains achievable latency. WebRTC with congestion control and jitter buffers delivers sub 500 millisecond one way latency. HTTP based streaming like HLS or DASH prioritizes scale and CDN compatibility but adds 2 to 6 seconds in low latency modes and 6 to 30 seconds in standard modes.
💡 Key Takeaways
•Interactive video targets under 200ms end to end, safety loops need 30 to 100ms, live analytics tolerates 500ms to few seconds
•Hardware decode reduces 1080p30 processing from tens of milliseconds CPU to single digit milliseconds on dedicated units
•Bounded queues with drop oldest policy prevent buffer bloat where latency silently creeps from 100ms to multiple seconds
•WebRTC delivers sub 500ms one way latency while HLS or DASH adds 2 to 6 seconds in low latency mode
•Frame decimation strategy processes 1 of 2 frames for inference while interpolating results maintains UI smoothness within budget
📌 Examples
Google Meet uses WebRTC with scalable video coding and bandwidth estimation to maintain 50 to 150ms one way latency on good networks
Production systems attach monotonic timestamps per stage: capture time, decode time, inference start and end, publish time to track p95 and p99 latency violations