ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Hard⏱️ ~3 min

Failure Modes in GPU Orchestration: Fragmentation, Deadlock, and Health Drift

GPU Orchestration Failures: Unlike CPU clusters where failures are usually obvious (crash, OOM), GPU orchestration fails in subtle ways: fragmentation blocks scheduling, deadlocks stall queues, and unhealthy GPUs produce wrong results silently. These failures manifest as degraded throughput and mysterious job failures.

Resource Fragmentation

Cluster has 32 GPUs but jobs requesting 8 GPUs cannot schedule. Why? The 32 GPUs are scattered: 4 nodes with 2 GPUs available each, 3 nodes with 4 GPUs each, and so on. No single node or topology-connected group has 8 available GPUs. Fragmentation worsens over time as long-running jobs occupy random positions. Mitigations: defragmentation (preempt low-priority jobs to consolidate space), bin-packing schedulers (fill nodes before spreading), and job sizing guidance (discourage awkward GPU counts that fragment easily).

Scheduling Deadlocks

Job A requests 8 GPUs, gets 4, waits for 4 more. Job B requests 8 GPUs, gets 4, waits for 4 more. Neither can proceed; both hold resources the other needs. Without gang scheduling (all-or-nothing allocation), deadlocks are common in multi-tenant clusters. Detection: monitor job wait times, alert when jobs wait longer than threshold. Resolution: preempt one job to free resources, implement gang scheduling, or use priority-based preemption where lower-priority jobs release resources.

GPU Health Drift

GPUs degrade silently. Memory errors accumulate (ECC correctable errors), thermal throttling reduces performance (overheating GPUs slow down), and driver issues cause intermittent failures. A "healthy" GPU that is actually throttled will complete jobs slowly, dragging down distributed training (all workers wait for the slowest). Monitoring must track: ECC error counts, GPU temperature and clock speeds, and per-GPU job completion times. Automatic remediation: drain unhealthy GPUs from scheduling pool, alert infrastructure team.

Monitoring Priority: Track fragmentation ratio (requested vs schedulable), job queue wait times, and per-GPU health metrics. These leading indicators predict failures before users notice.

💡 Key Takeaways
Fragmentation: 32 GPUs available but 8-GPU job cannot schedule due to scatter
Deadlock: two jobs each holding half resources waiting for the other half
GPU health drift: thermal throttling and ECC errors degrade performance silently
📌 Interview Tips
1Bin-packing schedulers fill nodes before spreading to reduce fragmentation
2Drain unhealthy GPUs from pool when ECC errors or throttling detected
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview