Training Infrastructure & PipelinesGPU Allocation & Job SchedulingHard⏱️ ~3 min

Failure Modes: Fragmentation, Thrashing, and Topology Misplacement

Graphics Processing Unit (GPU) scheduling introduces failure modes invisible in Central Processing Unit (CPU) only systems. Device fragmentation occurs when Multi Instance GPU (MIG) slices or whole GPU allocations strand capacity: you have 2 free slices on GPU A and 2 on GPU B, but queued jobs each need 3 slices from a single GPU. Aggregate capacity exists but is unusable. Memory fragmentation within a single GPU is equally insidious: long lived model weights and frequent allocation and deallocation of activations and gradients create unusable holes. Unlike CPU memory, GPU memory compaction is expensive or unavailable, so fragmentation persists until the job terminates. Scheduler induced idleness and thrashing waste even more resources. Default framework execution leaves GPUs idle 70 to 91% of time due to launch overhead, as discussed earlier. Preemption thrashing occurs when a job is killed, queued, restarted, and killed again before checkpointing, discarding all progress. Gang scheduling amplifies this: preempting one worker idles the other 7 in an 8 GPU distributed job, wasting 87.5% of allocated capacity. Without promotion policies, best effort (BE) jobs can be starved indefinitely by continuous latency sensitive (LE) arrivals. Topology misplacement degrades throughput by multiples. Placing an all reduce training job across PCIe links instead of NVLink drops bandwidth from approximately 600 GB/s to approximately 32 GB/s per direction, a 20x difference. NVIDIA Collective Communication Library (NCCL) ring reconfiguration during migration adds seconds of instability and can trigger cascading timeouts in tightly synchronized collectives. Heterogeneity drift is subtle but persistent: identical Graphics Processing Unit (GPU) Stock Keeping Units (SKUs) exhibit different performance due to thermal throttling (hotter rack positions run 5 to 10% slower), Error Correcting Code (ECC) memory errors, or silicon lottery variance. Static placement assumptions degrade over time without continuous profiling. Oversubscription and interference from naive co-location violate Service Level Objectives (SLOs). Multi Process Service (MPS) or pure time slicing can produce cache and memory bandwidth contention, causing latency spikes. Co-locating memory bound and compute bound jobs sounds optimal but workload phase changes (data loading followed by compute) create bursty interference. MIG reconfiguration requires evicting all tenants; there is no live resize, causing operational churn every time capacity needs rebalancing.
💡 Key Takeaways
Device fragmentation: 2 free MIG slices on GPU A plus 2 on GPU B cannot satisfy a job needing 3 slices from one GPU; aggregate capacity exists but is stranded by allocation boundaries
Topology misplacement degrades all reduce training throughput by 2 to 10x when jobs span PCIe (approximately 32 GB/s) instead of NVLink (approximately 600 GB/s); NCCL ring reconfiguration on migration adds seconds of instability
Preemption thrashing wastes progress when jobs are killed before checkpointing and restarted repeatedly; gang scheduled 8 GPU jobs amplify waste by idling 7 workers when 1 is preempted
Heterogeneity drift from thermal throttling, ECC errors, and silicon variance causes identical GPU SKUs to perform 5 to 10% differently; static placement degrades over time without continuous health monitoring
MIG reconfiguration requires evicting all tenants with no live resize, causing operational churn; MPS and naive co-location risk cache and bandwidth contention producing latency spikes that violate SLOs
📌 Examples
Production fragmentation scenario: 16×A100 cluster with MIG configured as 7 slices per GPU yields 112 total slices; after running 10×3 slice jobs and 5×2 slice jobs, remaining 67 slices are scattered across GPUs preventing any 4 slice job from running
Gandiva research: Migration typically completes in under 4 seconds but NCCL ring reconfiguration can add additional seconds and trigger timeouts in tightly synchronized collectives, requiring retry logic
Memory fragmentation observed in long running training: PyTorch eager mode allocates and frees activations per layer, creating 100MB to 500MB unusable holes in 80GB VRAM, eventually triggering out of memory despite 20GB nominally free
← Back to GPU Allocation & Job Scheduling Overview
Failure Modes: Fragmentation, Thrashing, and Topology Misplacement | GPU Allocation & Job Scheduling - System Overflow