Training Infrastructure & Pipelines • Training Orchestration (Kubeflow, MLflow, Airflow)Medium⏱️ ~3 min
Three Orchestration Tools: Airflow, Kubeflow Pipelines, and MLflow Roles
Three distinct tool categories are often conflated but serve different roles. Workflow orchestrators like Airflow, Kubeflow Pipelines, Prefect, and AWS Step Functions manage DAG execution: they schedule tasks based on dependencies and time triggers, handle retries with exponential backoff, track task state, and provide user interfaces for monitoring. Training backends like Kubernetes, AWS Batch, Apache Spark, or Ray actually execute the compute intensive work: they allocate central processing units (CPUs) and GPUs, run containers or processes, handle distributed communication, and report job status back to the orchestrator. Experiment and model lifecycle managers like MLflow or internal registries track what happened during each run: parameters like learning rate and batch size, metrics like accuracy and loss curves, artifacts like trained model files, and governance state like which model version is approved for production.
The production pattern is integration across all three. The orchestrator parametrizes and triggers a training run on a compute backend by submitting a Kubernetes Job specification or Batch API call. The training code running on that backend logs parameters and metrics to the experiment tracker throughout execution and writes model artifacts to object storage with versioned paths. On successful completion, the orchestrator receives a success signal, validates output artifacts exist, and calls the model registry API to promote the model from candidate to production status, which triggers downstream inference pipeline updates.
TheFork implements exactly this split: Airflow orchestrates the workflow and enforces data quality gates, Cloud batch services provide the compute backend, and MLflow handles all experiment tracking and model registry functions. This separation of concerns means you can swap Airflow for Kubeflow Pipelines without changing your MLflow tracking code, or migrate from one cloud batch service to another without rewriting orchestration logic.
💡 Key Takeaways
•Workflow orchestrators handle when and how to run tasks with scheduling and retry logic, training backends execute the actual compute with CPU and GPU allocation, experiment trackers record what happened with parameters and metrics
•Airflow excels at heterogeneous data workflows with thousands of scheduled DAGs, rich operator ecosystem, and mature backfill support for reprocessing historical date ranges
•Kubeflow Pipelines integrates tightly with Kubernetes for GPU scheduling, distributed training, and autoscaling but adds platform complexity and longer feedback cycles due to container image builds taking approximately 10 minutes per iteration as observed at Exness
•MLflow provides universal experiment tracking and model registry capabilities that complement any orchestrator choice, storing run metadata, artifacts, and promotion state independently
•Production pattern at scale: orchestrator validates data freshness and triggers backend job, training code logs to tracker during execution, orchestrator promotes model in registry on success and notifies inference systems
📌 Examples
LinkedIn feed ranking: Airflow DAG scheduled daily at 2am checks previous day engagement data completeness in Hadoop, triggers distributed training job on Kubernetes cluster with 16 GPU nodes, training code logs precision at k and normalized discounted cumulative gain (NDCG) metrics to MLflow every epoch, on NDCG > 0.78 Airflow calls registry API to promote model and trigger serving pipeline rebuild
Airbnb pricing model: Prefect orchestrator monitors S3 for new booking data partition, launches AWS Batch job with 64 CPU cores for feature computation across 5M listings, training container logs feature importance and Mean Absolute Error (MAE) to MLflow with link to data snapshot, promotes to production registry when MAE < $12 and model size < 200MB for mobile deployment