Computer Vision SystemsImage Classification at ScaleMedium⏱️ ~3 min

Model Versioning, Rollout, and Governance

Production machine learning systems require rigorous versioning and rollout processes to maintain reliability and enable safe iteration. Every model checkpoint bundles weights, preprocessing code, taxonomy version, training metrics, and metadata into a versioned artifact. This ensures reproducibility and prevents training serving skew. When a new model version deploys, the system must handle version mixing during transition, cache invalidation, and rollback capability. Canary deployments send 1 to 5 percent of live traffic to the new model while keeping the majority on the existing version. Teams monitor key metrics including prediction latency, error rates, and accuracy proxies like confidence distributions. If the canary shows regression, for example p99 latency increases from 80 ms to 150 ms or average confidence drops by 0.05, the rollout stops and engineering investigates. Shadow inference runs the new model on 10 percent of production traffic without serving results to users, comparing predictions and latency against the current model to detect distribution shifts before user impact. For high risk domains like content moderation or financial fraud, human in the loop review is mandatory for low confidence predictions. Predictions with confidence below 0.8 route to human auditors who provide ground truth labels. These labels feed back into training data, creating a continuous improvement loop. Audit trails track every prediction with model version, input hash, output label, confidence, and timestamp for compliance and debugging. Pinterest maintains 90 days of audit logs, roughly 50 TB for 10 billion daily predictions at 5 KB per record. Rollback plans are critical. Versioned cache keys enable instant traffic shifts by changing the lookup key from modelv47 to modelv46. Dual write periods maintain predictions from both old and new models during transition, enabling A/B comparison and instant fallback. Cost planning uses QPS, cache hit rate, and per request compute to size fleets. At 20,000 QPS with 90% cache hit rate and 250 QPS per GPU, the minimum fleet is 8 GPUs but teams provision 12 for headroom, costing approximately $15K per month at cloud GPU pricing.
💡 Key Takeaways
Model checkpoints bundle weights, preprocessing code, taxonomy version, and training metrics into versioned artifacts to ensure reproducibility and prevent skew
Canary deployments send 1 to 5 percent of traffic to new models, rolling back if p99 latency increases significantly or confidence drops by 0.05 or more
Shadow inference on 10 percent of traffic compares new and current models without user impact, detecting distribution shifts before full deployment
Human in the loop review for predictions below 0.8 confidence provides ground truth labels that feed back into training for continuous improvement
Audit trails tracking model version, input hash, output, confidence, and timestamp for every prediction enable compliance and debugging, storing roughly 50 TB for 90 days at 10B daily predictions
Versioned cache keys enable instant rollback by changing lookup from modelv47 to modelv46, dual write periods maintain both versions during transition
📌 Examples
Meta content moderation rollout: Shadow 10% traffic for 48 hours, canary 2% user traffic for 24 hours monitoring false negative rate, full rollout over 7 days with dual cache writes, rollback ready via cache key flip
Amazon product classification: Human review queue for confidence < 0.75 handles 8% of predictions, auditors label 200K images per week, improves rare category F1 by 12 points in next training cycle
Google Photos capacity planning: 20,000 QPS, 90% cache hit, 2,000 QPS to GPU tier, 250 QPS per A100, provision 12 GPUs for 8 minimum plus headroom, costs $15K per month at $1,250 per GPU cloud pricing
← Back to Image Classification at Scale Overview
Model Versioning, Rollout, and Governance | Image Classification at Scale - System Overflow