Training Infrastructure & Pipelines • GPU Allocation & Job SchedulingHard⏱️ ~3 min
Topology Aware Gang Scheduling for Distributed Training
Distributed Machine Learning (ML) training requires gang scheduling: a job makes progress only when its entire worker set (all tensor parallel, model parallel, and data parallel ranks) is co-allocated simultaneously. A single missing Graphics Processing Unit (GPU) stalls the entire collective communication operation, wasting all other allocated resources. The challenge is placement: where to locate these workers to minimize communication latency and maximize bandwidth for synchronous operations like all reduce.
GPU interconnect topology creates a hierarchy of bandwidth tiers. NVLink within a single node provides aggregate bandwidth of approximately 600 GB/s per A100 GPU. PCIe Gen4 x16 drops to approximately 32 GB/s per direction. Cross socket communication within a node traverses the CPU interconnect (Infinity Fabric or Ultra Path Interconnect), adding latency. Cross node communication uses Remote Direct Memory Access (RDMA) over Ethernet or InfiniBand, with bandwidth and latency determined by network fabric design. Placing an all reduce ring across PCIe instead of NVLink can degrade throughput by 2 to 10x for communication bound workloads.
Topology aware schedulers like HiveD model the GPU to PCIe switch to CPU socket to node hierarchy and assign affinity constraints. For synchronous data parallel training with all reduce, keep all workers within a single NVLink island (8 GPUs on modern NVIDIA servers). For larger jobs, pack workers to minimize cross socket and cross node hops. For parameter server architectures that tolerate higher latency, place parameter shards near Network Interface Card (NIC) rich nodes and ensure Non Uniform Memory Access (NUMA) affinity between NIC and GPU via PCIe switch mapping.
The tradeoff is utilization versus throughput. Strict topology constraints increase queue times and create fragmentation: a job needing 16 GPUs in 2 tightly coupled nodes may wait while 16 scattered GPUs sit idle. Relaxing constraints improves placement flexibility but risks degrading training throughput. Systems that allow elastic scaling (grow/shrink worker count at epoch boundaries) can defragment capacity dynamically, admitting new jobs without waiting for perfect topology alignment.
💡 Key Takeaways
•Gang scheduling requires entire worker set (all tensor parallel and data parallel ranks) to be co-allocated; a single missing GPU stalls collective operations and wastes all other allocated resources
•NVLink provides approximately 600 GB/s aggregate bandwidth per A100 within a node; PCIe Gen4 x16 drops to approximately 32 GB/s per direction; misplacement across these tiers degrades throughput by 2 to 10x for communication bound training
•Topology aware schedulers model GPU to PCIe switch to CPU socket hierarchy, keeping synchronous all reduce jobs within NVLink islands to minimize latency (sub 1ms vs 5 to 10ms cross node)
•Strict topology constraints improve training throughput but increase queue times and fragmentation: 16 scattered idle GPUs cannot run a job needing 2 tightly coupled 8 GPU nodes
•Elastic scaling (grow/shrink at epoch boundaries) and parameter server architectures with asynchronous updates tolerate relaxed placement, trading per iteration latency for higher utilization and shorter queue times
📌 Examples
Meta distributed training pattern: Data parallel ranks placed within NVLink islands on Zion and Grand Teton servers, minimizing cross socket and PCIe traffic for synchronous all reduce in large Distributed Data Parallel (DDP) and Mixture of Experts (MoE) training jobs
HiveD scheduler: Models GPU to PCIe to socket hierarchy, assigns affinity constraints to keep 8 GPU training jobs within single NVLink domain, queues jobs until topology requirement met to preserve throughput
OpenAI gang allocation: Model parallel and tensor parallel groups co-located in same rack or NVLink domain to minimize collective communication latency for transformer training at scale