Failure Modes in ML CI/CD Pipelines

Critical Failures
ML pipelines fail subtly: canary contamination from non-representative traffic, feature leakage in training, rollback incompatibilities, and cold start latency spikes.
CANARY CONTAMINATION
Traffic routing is not slice-aware, biasing evaluation. Mobile canary only routes iOS/North America due to load balancer config. Metrics look fine (iOS is high-intent) but model degrades for Android/international. Fix: stratified canary allocation and per-slice guard thresholds.
FEATURE BACKFILL LEAKAGE
Training metrics lie when backfill includes future data. Feature query fetches engagement through Jan 20 when predicting Jan 15 events—model sees future info, achieves 0.92 AUC offline but drops to 0.80 in production. Fix: time-bounded queries with strict timestamp filters, validate no feature timestamp exceeds label timestamp.
⚠️ Warning: Feature leakage is insidious because offline metrics look great. Always validate temporal ordering.
ROLLBACK INCOMPATIBILITIES
New model adds feature, serving code updated. Canary triggers rollback but prior model does not expect new schema and crashes. Fix: schema versioning with backward compatibility—serving handles both schemas during rollout. Test rollback in staging verifying old model works with new environment.
COLD START FAILURES
Canary scales 2→20 replicas during surge. New replicas load 2GB model, initialize 500MB index—takes 45s. They receive traffic before ready, p99 spikes 50ms→800ms. Fix: mark not-ready until init completes, use init containers, maintain warm standby pool.

💡 Key Takeaways

✓Canary contamination from non stratified routing: iOS North America only canary looks healthy (high click through rate) but model degrades for Android and international, requires slice aware allocation and per slice guards

✓Feature backfill leakage: Backfill fetches engagement through January 20 when predicting January 15 events, model sees future data and achieves 0.92 offline AUC but drops to 0.80 in production, requires strict timestamp filters and validation

✓Rollback incompatibility: New model adds feature schema field, old model crashes on rollback because serving code sends new schema, requires backward compatible schema versioning and rollback tests in staging

✓Cold start autoscaling: New replicas take 45 seconds to load 2GB model and 500MB embeddings, receive traffic before ready, p99 latency spikes from 50ms to 800ms, requires pre warming and admission control

✓Non determinism from unpinned dependencies and hardware: Same code and data yield different model weights on V100 vs A100 GPUs or NumPy 1.23 vs 1.24, breaks reproducibility and rollback, requires captured environment fingerprints

✓Feedback loop contamination: Recommending popular items makes them more popular, training on this data amplifies bias, reduces diversity, requires position debiased training or randomized exploration to break loop

📌 Interview Tips

1Netflix canary contamination incident: Canary routed only to premium subscribers, metrics looked good (higher engagement), full rollout degraded free tier users, required per subscription tier guards and stratified traffic allocation

2Uber feature leakage bug: Training pipeline backfilled driver acceptance rate with lookahead window, offline validation showed 0.91 AUC, production dropped to 0.83, fixed with strict timestamp assertion that blocked training if any feature timestamp exceeded label timestamp by more than 1 hour

3Meta model rollback failure: New ranking model required real time feature from streaming pipeline, old model did not use it, rollback caused feature fetch errors, service degraded for 8 minutes until manual intervention, fixed by dual read paths supporting both feature schemas

4Google Search autoscaling thrash: Traffic surge scaled replicas from 10 to 100, new replicas took 30 seconds to load model, p99 latency exceeded 200ms, caused rollback loop, fixed with pre warmed standby pool and readiness probes that delay traffic until model loaded

← Back to CI/CD for ML Overview