Computer Vision SystemsImage Classification at ScaleHard⏱️ ~3 min

Failure Modes and Production Reliability

Failure Modes

Production classification systems fail in predictable ways. Understanding these failure modes helps you design monitoring and fallback mechanisms before problems reach users.

Distribution Shift

Models assume production data looks like training data. When user behavior changes, seasons shift, or new content types emerge, predictions degrade silently. A model trained on professional photos fails on smartphone selfies. A summer-trained model struggles with winter scenes.

Detection: Monitor prediction confidence distributions. Sudden drops in average confidence signal distribution shift. Track per-class accuracy on labeled samples weekly.

Mitigation: Retrain on recent data quarterly. Use online learning for fast adaptation. Maintain human review queues for low-confidence predictions.

Adversarial Attacks

Small perturbations invisible to humans can flip model predictions. A stop sign with a few pixels changed might classify as a speed limit sign. This matters for security-sensitive applications like content moderation or autonomous systems.

Detection: Monitor for images with unusual pixel patterns. Track prediction volatility under minor transformations.

Mitigation: Adversarial training adds perturbed examples during training. Ensemble multiple models - attacks that fool one model rarely fool all.

Infrastructure Failures

GPU memory exhaustion: Large batches or memory leaks crash inference servers. Monitor GPU memory utilization and set hard limits.

Latency spikes: Garbage collection, thermal throttling, or noisy neighbors cause intermittent slowdowns. Use P99 latency monitoring, not just averages.

Cold start: Model loading takes 10-30 seconds. Keep warm instances ready. Preload models on deployment.

💡 Key Takeaways
Distribution shift causes silent accuracy degradation - monitor confidence distributions and per-class accuracy weekly
Adversarial attacks fool models with invisible perturbations - use adversarial training and model ensembles for defense
GPU memory exhaustion and cold starts are common infrastructure failures - monitor utilization and keep warm instances
P99 latency reveals spikes that averages hide - always monitor tail latency for production systems
📌 Interview Tips
1Interview Tip: Explain distribution shift detection with confidence monitoring - dropping average confidence is an early warning signal
2Interview Tip: Mention cold start as a deployment concern - 10-30 second model load times require warm instance pools
← Back to Image Classification at Scale Overview