Cost and Capacity Management for Continuous Training at Scale
Resource Management Challenges
Continuous training at scale demands careful resource management. Uber runs thousands of models with retraining cadences from hours to days, creating coordinated load spikes that can starve training clusters. The solution is admission control and staggering: prioritize models by business criticality (fraud and safety first), enforce per model compute budgets, and spread scheduled retrains across time windows. Netflix staggers nightly retrain jobs across 2 to 4 hour windows, targeting 70 to 80 percent average utilization.
Cost Optimization
Cost optimization strategies balance freshness against expense. Full retrains on large datasets are expensive: a typical recommender model training on 200 million interactions for 6 hours costs thousands of dollars in compute. The sweet spot is hybrid with caching: run daily full retrains using spot instances (50 to 70 percent cost reduction), cache materialized features in object storage to avoid recomputation, and use incremental updates only for fast moving signals.
Capacity Planning
Capacity planning requires forecasting retrain load. Model count grows linearly with product features and market segmentation. Set auto trigger frequency caps (maximum 1 retrain per day per model) to prevent drift induced storms. During incidents, freeze non critical retrains and reserve capacity for firefighting. Monitor queue depth and job latency: if p95 job start delay exceeds 30 minutes, scale clusters or throttle submissions.
Cost Justification
The key metric is cost per incremental accuracy point: if a 1 percent AUC gain costs $10,000 per month in extra compute, validate that business lift justifies the expense.