Failure Modes in Production Video ML Systems

Stream Failure
Camera feeds drop unexpectedly. Network congestion, hardware failure, power outages. A system processing 1000 streams will see multiple failures per hour. The pipeline must handle missing frames gracefully without crashing or corrupting state.
Detection: Monitor frame arrival timestamps. Alert when frames stop arriving or timestamps show gaps.
Recovery: Reset tracking state for affected cameras. Resume processing when stream returns. Log gaps for offline analysis.
Latency Spikes
GPU inference time varies with scene complexity. A frame with 50 people takes longer than an empty room. Occasional frames exceed the latency budget, causing downstream delays.
Mitigation: Implement frame dropping when queue depth exceeds threshold. Process most recent frames, skip stale ones. Better to analyze current reality than catch up on history.
Model Accuracy Drift
Video analytics accuracy degrades silently. Lighting changes, camera angles shift, new object types appear. A model trained for daytime struggles at night. Seasonal changes introduce new failure modes.
Detection: Monitor prediction confidence distributions. Sample frames for human review. Compare current accuracy against held-out validation sets.
Resource Exhaustion
GPU memory: Memory leaks accumulate over hours. Restart workers periodically or implement explicit memory management.
CPU saturation: Decode stage falls behind, queues grow. Monitor decode latency separately from inference latency.

💡 Key Takeaways

✓Stream failures happen constantly at scale - 1000 streams = multiple failures per hour

✓Latency spikes from complex scenes require frame dropping to stay current rather than catching up

✓Model accuracy drifts silently with lighting, season, and camera angle changes

✓Memory leaks and CPU saturation require periodic restarts and per-stage latency monitoring

📌 Interview Tips

1Interview Tip: Mention frame dropping as a deliberate strategy - current analysis beats stale catch-up

2Interview Tip: Describe confidence monitoring as early warning for accuracy drift

← Back to Real-time Video Processing Overview