Failure Modes and Production Reliability
Failure Modes
Production classification systems fail in predictable ways. Understanding these failure modes helps you design monitoring and fallback mechanisms before problems reach users.
Distribution Shift
Models assume production data looks like training data. When user behavior changes, seasons shift, or new content types emerge, predictions degrade silently. A model trained on professional photos fails on smartphone selfies. A summer-trained model struggles with winter scenes.
Detection: Monitor prediction confidence distributions. Sudden drops in average confidence signal distribution shift. Track per-class accuracy on labeled samples weekly.
Mitigation: Retrain on recent data quarterly. Use online learning for fast adaptation. Maintain human review queues for low-confidence predictions.
Adversarial Attacks
Small perturbations invisible to humans can flip model predictions. A stop sign with a few pixels changed might classify as a speed limit sign. This matters for security-sensitive applications like content moderation or autonomous systems.
Detection: Monitor for images with unusual pixel patterns. Track prediction volatility under minor transformations.
Mitigation: Adversarial training adds perturbed examples during training. Ensemble multiple models - attacks that fool one model rarely fool all.
Infrastructure Failures
GPU memory exhaustion: Large batches or memory leaks crash inference servers. Monitor GPU memory utilization and set hard limits.
Latency spikes: Garbage collection, thermal throttling, or noisy neighbors cause intermittent slowdowns. Use P99 latency monitoring, not just averages.
Cold start: Model loading takes 10-30 seconds. Keep warm instances ready. Preload models on deployment.