Retraining Strategies: Batch vs Incremental vs Hybrid
Full Batch Retrain
Three retraining strategies dominate production systems, each with distinct trade offs. Full batch retrain starts from scratch on a sliding window (last 7 to 28 days), recomputing all weights. This is the safest and most reproducible approach, used by Netflix for nightly homepage personalization retrains. The cost is high compute (training clusters must handle peak load) and slow reaction (cannot deploy faster than training completes, typically hours to days for large models).
Incremental Updates
Incremental or online updates continue training from the previous model checkpoint, processing only new data. Meta uses this for embedding layers and calibrators, updating every 15 to 60 minutes to capture trending topics in feeds. The benefit is speed and low cost (only process deltas), but risks accumulate: catastrophic forgetting (model loses older patterns), bias drift (overweights recent data), and harder validation (no clean baseline).
Hybrid Strategies
Hybrid strategies combine both: Uber runs daily full retrains for core ETA models while updating nearline contextual features (traffic conditions, weather) every 5 minutes. This balances stability (full retrain prevents drift accumulation) with responsiveness (nearline features capture real time conditions).
Cost Comparison
The pattern is periodic full retrain (daily to weekly) plus frequent small updates (minutes to hours) for fast moving signals. Cost wise, hybrid is the most expensive but delivers the best accuracy for high value use cases like fraud detection or dynamic pricing.