Training Infrastructure & PipelinesTraining Orchestration (Kubeflow, MLflow, Airflow)Medium⏱️ ~2 min

Training Orchestration: Coordinating the ML Pipeline as a DAG

Training orchestration manages the complete machine learning training lifecycle as a directed acyclic graph (DAG) of tasks. Each node in this graph represents a distinct operation like data preparation, feature computation, model training, evaluation, or model registration. The orchestrator schedules these tasks in dependency order, handles retries when steps fail, propagates metadata between stages, and isolates failures so you can resume from the last successful checkpoint rather than restarting everything. The core design principle is separating concerns into two planes. The control plane handles orchestration logic: scheduling decisions, retry policies, timeout enforcement, and metadata tracking. The data plane executes the actual compute intensive work: reading gigabytes of training data, computing features across millions of rows, running gradient descent for hours on graphics processing units (GPUs). This separation lets you scale compute independently from orchestration and swap out backends without rewriting pipeline logic. In production, this typically flows as: orchestrator triggers a training run with specific parameters on a compute backend like Kubernetes, the training job logs metrics and parameters to an experiment tracker like MLflow, artifacts get written to object storage with versioned paths, and on successful validation the orchestrator promotes the model in a registry and triggers downstream inference pipelines. TheFork implements exactly this pattern with Airflow orchestrating daily DAGs, jobs running on batch compute, and MLflow handling all experiment tracking and model promotion through a central registry.
💡 Key Takeaways
Control plane manages scheduling, retries, and metadata tracking while data plane handles compute intensive operations like feature generation and model training on separate backends
Steps must be idempotent and deterministic with explicit versioned inputs and outputs stored in durable storage to enable safe retries and backfills without data corruption
Orchestrator triggers compute jobs asynchronously and polls for completion rather than running compute directly, isolating control plane load from data plane resource spikes
TheFork production pattern: Airflow orchestrates daily DAGs with data freshness checks, batch backend executes training, MLflow tracks experiments and manages model promotion through registry
Pass artifact references like object storage uniform resource identifiers (URIs) or database identifiers (IDs) between steps instead of large payloads to avoid memory bottlenecks and enable parallel execution
📌 Examples
Netflix training pipeline: Airflow DAG validates that previous 24 hours of user interaction data arrived (freshness check), triggers Spark job to compute user and item embeddings across 200M interactions, launches distributed TensorFlow training on Kubernetes with 8 GPU nodes, logs metrics to internal registry, and on Area Under the Curve (AUC) > 0.82 promotes model to staging
Uber demand forecasting: Orchestrator checks that ride completion events for target date exist in data warehouse, runs feature computation across 50M trips with 200 features per city, trains gradient boosting model per metropolitan area in parallel, validates Mean Absolute Percentage Error (MAPE) < 15% threshold, registers models with metadata linking to exact data snapshot and code commit for audit compliance
← Back to Training Orchestration (Kubeflow, MLflow, Airflow) Overview