Definition
CI/CD for ML extends traditional software delivery to treat data, features, and models as first-class artifacts alongside code. Unlike stateless web services, ML systems must manage training datasets (often terabytes), feature pipelines, model binaries (sometimes gigabytes), and metadata that captures full lineage.
WHY ML IS DIFFERENT
Models are trained on historical data and validated on offline benchmarks that may not reflect live traffic. A recommendation model might achieve 0.85 precision@10 offline but encounter shifted user behavior, missing features from upstream latency, or numerical differences between training and serving environments at runtime.
💡 Key Risk: Training-serving skew, data drift, and model decay are ML-specific failure modes that do not exist in traditional software deployments.
THREE DISTINCT LOOPS
Continuous Integration: Validates code, feature transformations, and data contracts in under 10 minutes using small fixtures and unit tests.
Continuous Training: Runs on schedule or drift trigger, trains on production-scale data (e.g., 8TB covering 14 days), produces candidate model with lineage (commit hash, data snapshot IDs, metrics).
Continuous Deployment: Moves promoted model through staging, shadow (sees live traffic but does not affect users), and progressive canary rollouts with guardrails watching latency, error rates, and business metrics.
PRODUCTION SCALE
Large platforms manage thousands of models with hourly/daily retraining, model registries tracking full lineage, and online feature stores delivering features in under 10ms. Shadow evaluations run before promotion. P95 inference stays at 20-40ms. Pipelines must prove statistical performance against baselines and slice-based fairness metrics before any model touches user traffic.
✓ML CI/CD manages code plus data, features, and model artifacts as versioned, immutable dependencies with full lineage tracking
✓Training serving skew is the silent killer: Different transformations, missing value handling, or numerical libraries between training and serving cause metric drops that offline validation misses
✓Three loop architecture separates concerns: CI completes in under 10 minutes for fast feedback, continuous training operates on production scale data, CD uses shadow and canary with automated guards
✓Shadow deployment surfaces skew and latency issues without user impact by logging predictions on live traffic for a fixed budget like 10 million requests over 2 hours before promoting to canary
✓Progressive rollouts start at 1 percent traffic for 30 minutes, then 5 percent for 2 hours, with automated rollback triggering in under 2 minutes if p95 latency exceeds 50ms or business KPIs drop more than 2 percent
✓Production scale examples: Uber Michelangelo manages thousands of models with sub 10ms feature fetches, Netflix maintains 20 to 40ms p95 inference for personalization, Google TFX formalizes data validation and skew detection