Training Infrastructure & PipelinesGPU Allocation & Job SchedulingHard⏱️ ~3 min

Failure Modes: Fragmentation, Thrashing, and Topology Misplacement

Device Fragmentation

GPU scheduling introduces failure modes invisible in CPU only systems. Device fragmentation occurs when MIG slices or whole GPU allocations strand capacity: you have 2 free slices on GPU A and 2 on GPU B, but queued jobs each need 3 slices from a single GPU. Aggregate capacity exists but is unusable. Memory fragmentation within a single GPU is equally insidious: long lived model weights and frequent allocation and deallocation of activations create unusable holes.

Scheduler Induced Idleness

Scheduler induced idleness and thrashing waste even more resources. Default framework execution leaves GPUs idle 70 to 91 percent of time due to launch overhead. Preemption thrashing occurs when a job is killed, queued, restarted, and killed again before checkpointing, discarding all progress. Gang scheduling amplifies this: preempting one worker idles the other 7 in an 8 GPU distributed job, wasting 87.5 percent of allocated capacity.

Topology Misplacement

Topology misplacement degrades throughput by multiples. Placing an all reduce training job across PCIe links instead of NVLink drops bandwidth from approximately 600 GB/s to approximately 32 GB/s per direction, a 20x difference. NCCL ring reconfiguration during migration adds seconds of instability and can trigger cascading timeouts. Heterogeneity drift is subtle: identical GPU SKUs exhibit different performance due to thermal throttling, ECC memory errors, or silicon lottery variance.

Interference and Oversubscription

Oversubscription and interference from naive co-location violate SLOs. MPS or pure time slicing can produce cache and memory bandwidth contention, causing latency spikes. Co-locating memory bound and compute bound jobs sounds optimal but workload phase changes create bursty interference. MIG reconfiguration requires evicting all tenants with no live resize, causing operational churn.

💡 Key Takeaways
Device fragmentation: 2 free MIG slices on GPU A plus 2 on GPU B cannot satisfy a job needing 3 slices from one GPU; aggregate capacity exists but is stranded by allocation boundaries
Topology misplacement degrades all reduce training throughput by 2 to 10x when jobs span PCIe (approximately 32 GB/s) instead of NVLink (approximately 600 GB/s); NCCL ring reconfiguration on migration adds seconds of instability
Preemption thrashing wastes progress when jobs are killed before checkpointing and restarted repeatedly; gang scheduled 8 GPU jobs amplify waste by idling 7 workers when 1 is preempted
Heterogeneity drift from thermal throttling, ECC errors, and silicon variance causes identical GPU SKUs to perform 5 to 10% differently; static placement degrades over time without continuous health monitoring
MIG reconfiguration requires evicting all tenants with no live resize, causing operational churn; MPS and naive co-location risk cache and bandwidth contention producing latency spikes that violate SLOs
📌 Interview Tips
1Production fragmentation scenario: 16×A100 cluster with MIG configured as 7 slices per GPU yields 112 total slices; after running 10×3 slice jobs and 5×2 slice jobs, remaining 67 slices are scattered across GPUs preventing any 4 slice job from running
2Gandiva research: Migration typically completes in under 4 seconds but NCCL ring reconfiguration can add additional seconds and trigger timeouts in tightly synchronized collectives, requiring retry logic
3Memory fragmentation observed in long running training: PyTorch eager mode allocates and frees activations per layer, creating 100MB to 500MB unusable holes in 80GB VRAM, eventually triggering out of memory despite 20GB nominally free
← Back to GPU Allocation & Job Scheduling Overview
Failure Modes: Fragmentation, Thrashing, and Topology Misplacement | GPU Allocation & Job Scheduling - System Overflow