Failure Modes: Fragmentation, Thrashing, and Topology Misplacement
Device Fragmentation
GPU scheduling introduces failure modes invisible in CPU only systems. Device fragmentation occurs when MIG slices or whole GPU allocations strand capacity: you have 2 free slices on GPU A and 2 on GPU B, but queued jobs each need 3 slices from a single GPU. Aggregate capacity exists but is unusable. Memory fragmentation within a single GPU is equally insidious: long lived model weights and frequent allocation and deallocation of activations create unusable holes.
Scheduler Induced Idleness
Scheduler induced idleness and thrashing waste even more resources. Default framework execution leaves GPUs idle 70 to 91 percent of time due to launch overhead. Preemption thrashing occurs when a job is killed, queued, restarted, and killed again before checkpointing, discarding all progress. Gang scheduling amplifies this: preempting one worker idles the other 7 in an 8 GPU distributed job, wasting 87.5 percent of allocated capacity.
Topology Misplacement
Topology misplacement degrades throughput by multiples. Placing an all reduce training job across PCIe links instead of NVLink drops bandwidth from approximately 600 GB/s to approximately 32 GB/s per direction, a 20x difference. NCCL ring reconfiguration during migration adds seconds of instability and can trigger cascading timeouts. Heterogeneity drift is subtle: identical GPU SKUs exhibit different performance due to thermal throttling, ECC memory errors, or silicon lottery variance.
Interference and Oversubscription
Oversubscription and interference from naive co-location violate SLOs. MPS or pure time slicing can produce cache and memory bandwidth contention, causing latency spikes. Co-locating memory bound and compute bound jobs sounds optimal but workload phase changes create bursty interference. MIG reconfiguration requires evicting all tenants with no live resize, causing operational churn.