ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Hard⏱️ ~3 min

Topology Aware Scheduling and Gang Scheduling for Distributed Training

Distributed training jobs spanning multiple GPUs require topology aware placement and gang scheduling to avoid severe performance degradation and resource deadlocks. Topology awareness means understanding GPU interconnects and placing workloads to maximize bandwidth and minimize latency. Within a single node, NVLink provides 300 to 900 GB/s of bidirectional bandwidth between GPUs, with all reduce latency under 1 millisecond for 32 MB buckets. Across nodes, InfiniBand or RoCE networks provide 200 to 400 Gbps with 10 to 20 microseconds latency. Placing a 4 GPU job across two nodes instead of one node forces cross node communication and reduces throughput by 30 to 60 percent. Similarly, placing GPUs on opposite sides of a PCIe switch fabric within the same node introduces a 10 to 20 percent penalty compared to a NVLink connected group. Gang scheduling ensures all GPUs for a distributed job are allocated simultaneously. A training job requesting 512 GPUs cannot start until the scheduler finds 512 available devices with the required topology. Without gang scheduling, the system might allocate 400 GPUs and wait indefinitely for the remaining 112, blocking those 400 from other work. Gang scheduling prevents partial allocations and resource deadlocks, but introduces its own challenges. A large job waiting for full placement can block many smaller jobs if not managed carefully. Time bound reservations with backfilling allow small jobs to run while assembling capacity for large ones, keeping overall cluster utilization above 70 to 80 percent. At Meta's scale of approximately 24,000 H100 GPUs, topology aware placement becomes mandatory. Jobs must respect rack level power limits (40 to 60 kW per rack), optimize for spine leaf network topology, and coordinate with storage placement to avoid hotspots. Even a 1 percent improvement in GPU utilization at this scale translates to hundreds of thousands of dollars monthly in effective capacity.
💡 Key Takeaways
NVLink provides 300 to 900 GB/s bidirectional bandwidth within a node with sub millisecond latency, while cross node InfiniBand at 200 to 400 Gbps adds 10 to 20 microseconds. Placing distributed training across topology boundaries creates 30 to 60 percent throughput penalties.
Gang scheduling admits a job only when all requested GPUs are available. A 512 GPU job with minimum cardinality of 512 blocks until the scheduler can place the full set, avoiding partial allocations that waste resources and cause deadlocks.
Backfilling allows small jobs to run while large jobs wait for full capacity. Without backfilling, a cluster can sit at 40 percent utilization while waiting to assemble 512 contiguous GPUs. With backfilling, utilization stays above 70 to 80 percent.
Scheduler scoring functions prioritize placements that maximize intra node NVLink edges, minimize cross PCIe switch traffic, and pack GPUs to reduce fragmentation. A typical scoring function might add 100 points for same node, 50 points for same rack, and 10 points for same availability zone.
Time bound reservations prevent large jobs from holding partial capacity indefinitely. A reservation might expire after 10 minutes if full placement cannot be achieved, releasing resources for other work and retrying later.
At scales of 10,000+ GPUs, placement must coordinate with power budgets (40 to 60 kW per rack), network spine leaf topology, and storage proximity to avoid creating hotspots or saturating shared resources.
📌 Examples
A language model training job requests 512 A100 GPUs with gang scheduling. The scheduler finds 64 nodes with 8 GPUs each, all within the same InfiniBand fabric domain. All reduce operations complete in 15 to 25 milliseconds for 128 MB gradients, achieving 90 percent scaling efficiency compared to single node baseline.
Without topology awareness, a 4 GPU job is split across two nodes. Cross node all reduce adds 18 milliseconds per iteration compared to 0.8 milliseconds with NVLink. Over 100,000 training iterations, this adds 30 minutes of wall clock time and reduces GPU utilization from 92 percent to 65 percent.
Meta's training infrastructure uses gang scheduling for large language model runs. A job needing 2,048 H100 GPUs waits up to 2 hours for full placement during peak demand. Backfilling allows hundreds of smaller jobs to complete during this window, keeping cluster utilization above 75 percent.
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview
Topology Aware Scheduling and Gang Scheduling for Distributed Training | Resource Orchestration (Kubernetes, GPU Scheduling) - System Overflow