ML Infrastructure & MLOps • CI/CD for MLHard⏱️ ~3 min
Failure Modes in ML CI/CD Pipelines
Production ML pipelines fail in subtle ways that are invisible to traditional software monitors. Canary contamination occurs when traffic routing is not slice aware, biasing evaluation. For example, a mobile app canary may only route iOS users in North America due to load balancer configuration. Online metrics look fine because iOS users are high intent, but the model degrades for Android and international users. Without slice aware routing and per slice metric evaluation, the pipeline promotes a harmful model. The fix is stratified canary allocation and separate guard thresholds per traffic segment.
Feature backfill and leakage cause training metrics to lie. A training job backfills a user engagement feature by fetching the most recent value for each user as of the training window end date. If the backfill accidentally includes data from after the prediction timestamp (for example fetching engagement through January 20 when predicting January 15 events), the model sees future information and achieves inflated offline AUC of 0.92. In production, it only has past data and drops to 0.80 AUC. Preventing this requires time bounded feature queries with strict timestamp filters and validation that asserts no feature value has a timestamp later than the label timestamp.
Rollback incompatibilities break production during incidents. A new model version introduces a new feature schema, for example adding an embedding vector field. The serving code is updated to fetch this feature. During a canary rollout, a latency spike triggers automated rollback to the prior model version. However, the prior model does not expect the new feature schema and crashes on incompatible input, leaving the service down. The solution is schema versioning with backward compatibility: Serving code must handle both old and new schemas during the rollout window, and rollback tests in staging must verify the old model works with the new serving environment.
Cold start and autoscaling failures hit tail latency. A canary scales up from 2 to 20 replicas during a traffic surge. New replicas must load a 2 gigabyte model binary from cloud storage and initialize a 500 megabyte embedding index in memory. This takes 45 seconds, during which the replicas are not ready but receive traffic anyway, causing p99 latency to spike from 50ms to 800ms and triggering rollback. The fix is pre warming: Mark replicas as not ready until initialization completes, use init containers to load models before accepting traffic, or maintain a warm standby pool. Uber and Netflix pre warm model replicas and use admission control to shed load during scale up transients.
💡 Key Takeaways
•Canary contamination from non stratified routing: iOS North America only canary looks healthy (high click through rate) but model degrades for Android and international, requires slice aware allocation and per slice guards
•Feature backfill leakage: Backfill fetches engagement through January 20 when predicting January 15 events, model sees future data and achieves 0.92 offline AUC but drops to 0.80 in production, requires strict timestamp filters and validation
•Rollback incompatibility: New model adds feature schema field, old model crashes on rollback because serving code sends new schema, requires backward compatible schema versioning and rollback tests in staging
•Cold start autoscaling: New replicas take 45 seconds to load 2GB model and 500MB embeddings, receive traffic before ready, p99 latency spikes from 50ms to 800ms, requires pre warming and admission control
•Non determinism from unpinned dependencies and hardware: Same code and data yield different model weights on V100 vs A100 GPUs or NumPy 1.23 vs 1.24, breaks reproducibility and rollback, requires captured environment fingerprints
•Feedback loop contamination: Recommending popular items makes them more popular, training on this data amplifies bias, reduces diversity, requires position debiased training or randomized exploration to break loop
📌 Examples
Netflix canary contamination incident: Canary routed only to premium subscribers, metrics looked good (higher engagement), full rollout degraded free tier users, required per subscription tier guards and stratified traffic allocation
Uber feature leakage bug: Training pipeline backfilled driver acceptance rate with lookahead window, offline validation showed 0.91 AUC, production dropped to 0.83, fixed with strict timestamp assertion that blocked training if any feature timestamp exceeded label timestamp by more than 1 hour
Meta model rollback failure: New ranking model required real time feature from streaming pipeline, old model did not use it, rollback caused feature fetch errors, service degraded for 8 minutes until manual intervention, fixed by dual read paths supporting both feature schemas
Google Search autoscaling thrash: Traffic surge scaled replicas from 10 to 100, new replicas took 30 seconds to load model, p99 latency exceeded 200ms, caused rollback loop, fixed with pre warmed standby pool and readiness probes that delay traffic until model loaded