Training Orchestration Failure Modes in Production
Backfill and Historical Rerun Failures
Backfills and historical reruns create the first major failure class. Kubeflow lacks native backfill support, forcing manual orchestration loops for reprocessing historical date ranges. Without proper idempotency and partitioning, concurrent backfill runs can double count training data or overwrite artifacts from other runs. When historical runs lack quota controls, they overwhelm clusters and starve daily production training jobs of resources. Organizations reprocessing 90 days of transaction data for model retraining have saturated their Kubernetes clusters, causing daily training SLA misses for hours until they added per namespace quotas and concurrency limits.
Environment Drift
Environment drift destroys reproducibility. Shared runtime orchestrators suffer works yesterday fails today when someone upgrades a shared library. Container based systems are not immune: base images that auto update without digest pinning create subtle behavior changes. One large recommendation system saw model quality drop 3 percent after an opencv library patch changed image preprocessing behavior, but the issue took days to debug because their container tags used latest instead of pinned digests. Missing dataset versioning in experiment tracking means audits and incident postmortems cannot exactly reproduce past runs because the underlying training data changed.
Resource Contention and Scheduling Failures
Resource contention and scheduling failures manifest differently by backend. GPU fragmentation happens when requested GPU shapes do not match node inventory: requesting 2 GPUs per job when nodes have 8 GPUs means 4 jobs per node, but if only 3 slots are free cluster wide, new jobs queue despite idle GPUs elsewhere. Distributed training stragglers from preemptions or slow nodes stretch training by 2 to 5 times without checkpointing and resume logic configured.
Data Freshness Failures
Data freshness failures occur when orchestrators lack strong sensors: training proceeds on partial partitions, downstream model metrics degrade silently, and the issue surfaces only when users notice recommendation quality dropped. Fail closed data quality gates are essential.