Learn→Training Infrastructure & Pipelines→Continuous Training & Model Refresh→6 of 6

Training Infrastructure & Pipelines • Continuous Training & Model RefreshHard⏱️ ~3 min

Cost and Capacity Management for Continuous Training at Scale

Resource Management Challenges
Continuous training at scale demands careful resource management. Uber runs thousands of models with retraining cadences from hours to days, creating coordinated load spikes that can starve training clusters. The solution is admission control and staggering: prioritize models by business criticality (fraud and safety first), enforce per model compute budgets, and spread scheduled retrains across time windows. Netflix staggers nightly retrain jobs across 2 to 4 hour windows, targeting 70 to 80 percent average utilization.
Cost Optimization
Cost optimization strategies balance freshness against expense. Full retrains on large datasets are expensive: a typical recommender model training on 200 million interactions for 6 hours costs thousands of dollars in compute. The sweet spot is hybrid with caching: run daily full retrains using spot instances (50 to 70 percent cost reduction), cache materialized features in object storage to avoid recomputation, and use incremental updates only for fast moving signals.
Capacity Planning
Capacity planning requires forecasting retrain load. Model count grows linearly with product features and market segmentation. Set auto trigger frequency caps (maximum 1 retrain per day per model) to prevent drift induced storms. During incidents, freeze non critical retrains and reserve capacity for firefighting. Monitor queue depth and job latency: if p95 job start delay exceeds 30 minutes, scale clusters or throttle submissions.
Cost Justification
The key metric is cost per incremental accuracy point: if a 1 percent AUC gain costs $10,000 per month in extra compute, validate that business lift justifies the expense.

💡 Key Takeaways

✓Stagger scheduled retrains across 2 to 4 hour windows to smooth cluster utilization, targeting 70 to 80 percent average with 20 to 30 percent headroom for priority spikes (Netflix approach across thousands of nightly jobs)

✓Preemptible or spot instances reduce training cost by 50 to 70 percent for fault tolerant batch jobs, with automatic retry on preemption and checkpointing every 30 minutes to avoid losing progress

✓Feature caching cuts costs by 40 percent: Meta caches materialized embeddings and aggregates in object storage, reusing across multiple model training runs instead of recomputing from raw events

✓Enforce per model compute budgets (GPU hours per month) and auto trigger frequency caps (maximum 1 retrain per day per model) to prevent drift storms and runaway costs when monitoring thresholds are misconfigured

✓Cost per incremental accuracy point is the key business metric: if 1 percent AUC gain costs $10,000 per month in extra compute, validate that business lift (revenue, engagement) justifies expense before increasing cadence

📌 Interview Tips

1Uber prioritizes fraud and safety model retrains with reserved capacity and immediate scheduling, while experimental recommendation features use spot instances queued during peak load, saving 60 percent on training costs

2Airbnb runs separate Smart Pricing models per market and property type (thousands of models total), capping each to daily retrains and using shared feature pipelines to amortize aggregation cost across models

← Back to Continuous Training & Model Refresh Overview