Failure Modes in Production Video ML Systems
Stream Failure
Camera feeds drop unexpectedly. Network congestion, hardware failure, power outages. A system processing 1000 streams will see multiple failures per hour. The pipeline must handle missing frames gracefully without crashing or corrupting state.
Detection: Monitor frame arrival timestamps. Alert when frames stop arriving or timestamps show gaps.
Recovery: Reset tracking state for affected cameras. Resume processing when stream returns. Log gaps for offline analysis.
Latency Spikes
GPU inference time varies with scene complexity. A frame with 50 people takes longer than an empty room. Occasional frames exceed the latency budget, causing downstream delays.
Mitigation: Implement frame dropping when queue depth exceeds threshold. Process most recent frames, skip stale ones. Better to analyze current reality than catch up on history.
Model Accuracy Drift
Video analytics accuracy degrades silently. Lighting changes, camera angles shift, new object types appear. A model trained for daytime struggles at night. Seasonal changes introduce new failure modes.
Detection: Monitor prediction confidence distributions. Sample frames for human review. Compare current accuracy against held-out validation sets.
Resource Exhaustion
GPU memory: Memory leaks accumulate over hours. Restart workers periodically or implement explicit memory management.
CPU saturation: Decode stage falls behind, queues grow. Monitor decode latency separately from inference latency.