ML Infrastructure & MLOps • Resource Orchestration (Kubernetes, GPU Scheduling)Hard⏱️ ~3 min
Failure Modes in GPU Orchestration: Fragmentation, Deadlock, and Health Drift
GPU orchestration systems face unique failure modes that can silently degrade performance, waste expensive resources, or create deadlocks that starve workloads.
Resource fragmentation occurs when available capacity cannot satisfy new requests despite appearing sufficient. Pre carving A100s into seven 1g.5gb slices optimizes for small models, but if most incoming jobs need 10 GB instances, hundreds of 5 GB slices sit idle. The cluster shows 300 free slices yet cannot admit a single job. Dynamic slice reconfiguration during low load windows and admission control that aligns request shapes with the current pool mitigate this, but add operational complexity.
Gang scheduling deadlock happens when large jobs block smaller ones. A 512 GPU job tries to assemble contiguous placement and reserves partial capacity. If not managed, 400 GPUs sit idle waiting for the final 112, blocking dozens of smaller jobs that could run immediately. This creates starvation and long tail wait times. Time bound reservations that expire after 10 to 15 minutes and backfilling algorithms that pack small jobs around large reservations keep utilization above 70 percent.
Device health drift is particularly insidious. ECC memory errors, thermal throttling, or driver bugs can degrade performance by 20 to 40 percent without crashing. Without device level health taints, the scheduler keeps placing work on flaky GPUs, creating mysterious slowdowns. A health controller must quarantine devices when error thresholds trip (for example, more than 5 ECC errors per hour or sustained temperature above 85 degrees Celsius) and trigger remediation. Metric blindness with MIG compounds this. If monitoring reports aggregate GPU utilization instead of per slice metrics, autoscaling makes decisions on misleading data, leading to over or under provisioning.
💡 Key Takeaways
•Fragmentation wastes capacity when slice sizes misalign with workload demand. A cluster with 400 free 5 GB MIG slices cannot admit jobs needing 10 GB. Dynamic reconfiguration during off peak hours and admission control that shapes requests prevent this.
•Gang scheduling deadlock blocks many small jobs while waiting to assemble capacity for one large job. Time bound reservations that expire after 10 to 15 minutes and backfilling algorithms that pack small work around reservations maintain 70+ percent utilization.
•Device health drift silently degrades performance by 20 to 40 percent without triggering failures. ECC errors, thermal throttling above 85 degrees Celsius, or driver bugs must trigger device taints and quarantine to prevent scheduler from assigning new work.
•Time slicing latency spikes create 2x to 5x p99 increases when one tenant saturates streaming multiprocessors or memory bandwidth. Strict per tenant concurrency limits and workload classification (real time on dedicated slices, batch on shared) prevent this.
•Topology unaware placement reduces throughput by 30 to 60 percent when distributed training spans nodes unnecessarily. A 4 GPU job split across two nodes communicates over 200 Gbps InfiniBand instead of 600 GB/s NVLink within one node.
•Cold starts without caching take 160 seconds to load 20 GB models from remote storage at 1 Gbps. Node local NVMe caches and warm pools reduce this to under 20 seconds, preventing cascading SLO violations during autoscaling events.
📌 Examples
A production cluster with 1,000 MIG slices in 1g.5gb profile receives jobs needing 2g.10gb instances. Despite 600 free slices, jobs queue because the scheduler cannot combine slices. After reconfiguring 50 percent of GPUs to 2g.10gb during nightly maintenance, job admission improves by 80 percent.
A 512 GPU training job waits 45 minutes for full placement while holding 480 GPUs reserved. During this window, 60 smaller jobs that could have run on those 480 GPUs are blocked. Implementing 15 minute reservation expiry and backfilling increases cluster utilization from 55 percent to 78 percent.
A health controller detects a GPU with 12 ECC errors in one hour and sustained 87 degrees Celsius temperature. It taints the node, evicts running pods, and triggers a hardware check. Before implementing this, flaky GPUs caused 25 percent of training jobs to experience unexplained 30 percent slowdowns.