Definition
Training orchestration manages the complete ML training lifecycle as a directed acyclic graph (DAG) of tasks. Each node represents a distinct operation like data preparation, feature computation, model training, evaluation, or model registration. The orchestrator schedules tasks in dependency order, handles retries, propagates metadata, and enables resume from checkpoints.
Control Plane vs Data Plane
The core design principle is separating concerns into two planes. The control plane handles orchestration logic: scheduling decisions, retry policies, timeout enforcement, and metadata tracking. The data plane executes the actual compute intensive work: reading gigabytes of training data, computing features across millions of rows, running gradient descent for hours on GPUs. This separation lets you scale compute independently from orchestration and swap out backends without rewriting pipeline logic.
Production Flow Pattern
In production, this typically flows as: orchestrator triggers a training run with specific parameters on a compute backend like Kubernetes, the training job logs metrics and parameters to an experiment tracker like MLflow, artifacts get written to object storage with versioned paths, and on successful validation the orchestrator promotes the model in a registry and triggers downstream inference pipelines.
Integration Architecture
The production pattern integrates workflow orchestrators (Airflow, Kubeflow Pipelines), compute backends (Kubernetes, Spark), and experiment trackers (MLflow) into a cohesive system where each component handles its specialty while metadata flows seamlessly between stages.
✓Control plane manages scheduling, retries, and metadata tracking while data plane handles compute intensive operations like feature generation and model training on separate backends
✓Steps must be idempotent and deterministic with explicit versioned inputs and outputs stored in durable storage to enable safe retries and backfills without data corruption
✓Orchestrator triggers compute jobs asynchronously and polls for completion rather than running compute directly, isolating control plane load from data plane resource spikes
✓TheFork production pattern: Airflow orchestrates daily DAGs with data freshness checks, batch backend executes training, MLflow tracks experiments and manages model promotion through registry
✓Pass artifact references like object storage uniform resource identifiers (URIs) or database identifiers (IDs) between steps instead of large payloads to avoid memory bottlenecks and enable parallel execution
1Netflix training pipeline: Airflow DAG validates that previous 24 hours of user interaction data arrived (freshness check), triggers Spark job to compute user and item embeddings across 200M interactions, launches distributed TensorFlow training on Kubernetes with 8 GPU nodes, logs metrics to internal registry, and on Area Under the Curve (AUC) > 0.82 promotes model to staging
2Uber demand forecasting: Orchestrator checks that ride completion events for target date exist in data warehouse, runs feature computation across 50M trips with 200 features per city, trains gradient boosting model per metropolitan area in parallel, validates Mean Absolute Percentage Error (MAPE) < 15% threshold, registers models with metadata linking to exact data snapshot and code commit for audit compliance