Learn→Training Infrastructure & Pipelines→Continuous Training & Model Refresh→5 of 6

Training Infrastructure & Pipelines • Continuous Training & Model RefreshHard⏱️ ~3 min

Failure Modes in Continuous Training Pipelines

Training Serving Skew
Production continuous training systems fail in predictable ways. Training serving skew is the most common: feature definitions diverge between offline and online code paths. Airbnb discovered a ranking model performed 15 percent worse in production because an online aggregation window was 60 minutes instead of the 24 hour window used in training. The fix is a single declarative feature definition in a feature store with automated parity validation.
Feedback Loops
Feedback loops create self reinforcing bias. Netflix recommendation models that show popular content make it more popular, skewing training data and reducing diversity. This causes filter bubbles and long term engagement drops. Mitigate with exploration mechanisms (epsilon greedy with 5 to 10 percent random recommendations), propensity score reweighting to debias training data, and counterfactual evaluation using historical logs.
Label Delay and Leakage
Label delay and leakage are subtle but catastrophic. Meta ad models predict conversions that occur 1 to 7 days after click, but training on all available labels at snapshot time includes future information. The solution is strict event time semantics: only use labels available at serving time, implement watermarking to handle late arriving labels, and validate with time based splits rather than random splits.
Data Quality Issues
Data outages and schema drift silently corrupt features: add schema contracts with type checking and range validation, canary new features on 1 percent of training data before full rollout, and abort training jobs when validation metrics exceed thresholds.

💡 Key Takeaways

✓Training serving skew causes 10 to 20 percent accuracy drops when feature definitions diverge: Airbnb caught a 60 minute online aggregation window versus 24 hour offline window, fixed by unified feature store with L2 distance monitoring (alert when exceeding 0.01 for normalized features)

✓Feedback loops create self reinforcing bias: Netflix recommendations showing popular content make it more popular, requiring 5 to 10 percent epsilon greedy exploration, propensity score reweighting, and counterfactual evaluation to prevent filter bubbles

✓Label delay and leakage require strict event time semantics: Meta ad conversion models must only use labels available at serving time (not all labels at training snapshot time), validated with time based splits showing 15 to 30 percent accuracy drop if violated

✓Retraining storms from false positive drift detection waste compute: Airbnb requires drift sustained over 100,000 samples with hysteresis and multiple signal agreement (PSI, KS test, calibration error) before triggering to prevent transient spike reactions

✓Data outages and schema drift silently corrupt features: implement schema contracts with type and range validation, canary new features on 1 percent of training data, abort jobs when null rate exceeds 5 percent or distribution stats breach thresholds

📌 Interview Tips

1Uber fraud detection faced adversarial feedback loops where blocking one attack vector caused fraudsters to shift tactics, requiring adversarial training with synthetic attack patterns and robust loss functions to handle distribution shift

2Meta discovered a conversion prediction model leaked future information by including conversions that occurred after the prediction timestamp, causing 25 percent accuracy drop in production fixed by event time windowing with 7 day label delay handling

← Back to Continuous Training & Model Refresh Overview