Canary Deployments and Automated Rollback for ML Models
CANARY DEPLOYMENTS
Instead of deploying a new model to 100% of traffic immediately, deploy to a small percentage (1-5%) first. Compare metrics between canary (new model) and control (old model).
Process: Deploy new model to 1% traffic. Monitor for 1-24 hours depending on traffic volume and risk tolerance. If metrics are stable or better, gradually increase to 10%, then 50%, then 100%. If metrics degrade, rollback immediately.
Metrics to compare: Latency (should not regress), error rate (should not increase), business metrics (CTR, conversion—should not decrease significantly).
Canary deployment catches problems before they affect all users. A model that crashes for 1% of users is bad; a model that crashes for 100% is catastrophic.
AUTOMATED ROLLBACK
Manual rollback is slow. By the time a human notices a problem, investigates, and rolls back, significant damage may have occurred. Automated rollback limits blast radius.
Trigger criteria: Error rate > 5% (immediate rollback). P99 latency > 2x baseline (rollback after 5 minutes). Business metric drop > 10% (rollback after 15 minutes with sufficient statistical confidence).
Implementation: Deployment system monitors metrics in real-time. Compares to baseline. If trigger criteria met, automatically routes traffic back to previous model version. Alerts team for investigation.
VERSION MANAGEMENT
Automated rollback requires having something to roll back to. Maintain at least 3 previous model versions ready to serve. Store model artifacts with metadata (training date, data version, metrics).
Rollback target selection: usually roll back to the immediately previous version. In rare cases (if previous version also had issues), roll back further. Version management enables this flexibility.