Training Infrastructure & Pipelines • Continuous Training & Model RefreshEasy⏱️ ~3 min
What is Continuous Training and Model Refresh?
Continuous training (CT) and model refresh transform machine learning from a one time deployment into a closed loop control system. The core problem is that production models decay over time: user behavior shifts, new products launch, competitors change tactics, and seasonal patterns evolve. A fraud model trained on pre holiday traffic will miss new attack vectors during Black Friday. A recommendation model trained three months ago cannot surface content that did not exist then.
The solution is automated retraining and redeployment pipelines that monitor model health, decide when to retrain, validate candidate models both offline and online, and gradually shift traffic only when metrics improve. This spans two freshness dimensions. Data freshness measures how quickly new events become features (streaming aggregates updated every 5 minutes versus daily batch features). Model freshness measures how quickly new patterns make it into model weights (hourly incremental updates versus weekly full retrains).
At scale, companies like Netflix retrain homepage personalization models nightly on hundreds of millions of member interactions, keeping inference latency under 30 milliseconds p95. Uber runs thousands of models for ride matching, ETA prediction, and fraud detection with retraining cadences from hours to days depending on drift velocity. Meta processes tens of thousands of training jobs daily across ads and feeds, with some embedding layers updating continuously and full models retraining daily. The key is balancing freshness (reacting to drift quickly) against stability (avoiding metric noise and operational churn).
💡 Key Takeaways
•Data freshness is how quickly new events become features (streaming updates every 5 minutes versus daily batch), while model freshness is how quickly new patterns update weights (hourly incremental versus weekly full retrain)
•Netflix retrains homepage personalization nightly on hundreds of millions of interactions with inference latency under 30 milliseconds p95, balancing freshness with serving cost
•Uber runs thousands of models with cadences from hours (fraud during peak events) to days (pricing models), triggering retrains on drift thresholds like Population Stability Index (PSI) exceeding 0.2
•The core trade off is freshness versus stability: frequent retraining reacts quickly to drift but risks overfitting to short term noise and metric flapping, while slower cadence is more stable but risks stale predictions during regime shifts
•Typical online inference Service Level Objectives (SLOs) are p95 latency 10 to 30 milliseconds per model stage, with total chains under 50 to 200 milliseconds depending on product requirements
📌 Examples
Meta processes tens of thousands of training jobs daily with some embedding layers updating continuously (minutes to hours) and full models retraining daily, maintaining p99 latency under 10 to 20 milliseconds per model stage
Airbnb Smart Pricing retrains weekly to capture seasonality and event driven demand shifts, with nearline feature aggregation windows of 5 to 60 minutes depending on feature volatility