Topology Aware Gang Scheduling for Distributed Training
Why Gang Scheduling Matters
Distributed ML training requires gang scheduling: a job makes progress only when its entire worker set (all tensor parallel, model parallel, and data parallel ranks) is co-allocated simultaneously. A single missing GPU stalls the entire collective communication operation, wasting all other allocated resources. The challenge is placement: where to locate these workers to minimize communication latency and maximize bandwidth for synchronous operations like all reduce.
GPU Interconnect Hierarchy
GPU interconnect topology creates a hierarchy of bandwidth tiers. NVLink within a single node provides aggregate bandwidth of approximately 600 GB/s per A100 GPU. PCIe Gen4 x16 drops to approximately 32 GB/s per direction. Cross socket communication within a node traverses the CPU interconnect, adding latency. Cross node communication uses RDMA over Ethernet or InfiniBand. Placing an all reduce ring across PCIe instead of NVLink can degrade throughput by 2 to 10x for communication bound workloads.
Topology Aware Placement
Topology aware schedulers like HiveD model the GPU to PCIe switch to CPU socket to node hierarchy and assign affinity constraints. For synchronous data parallel training with all reduce, keep all workers within a single NVLink island (8 GPUs on modern NVIDIA servers). For larger jobs, pack workers to minimize cross socket and cross node hops. For parameter server architectures that tolerate higher latency, place parameter shards near NIC rich nodes and ensure NUMA affinity between NIC and GPU.
Utilization vs Throughput Trade-off
The tradeoff is utilization versus throughput. Strict topology constraints increase queue times and create fragmentation: a job needing 16 GPUs in 2 tightly coupled nodes may wait while 16 scattered GPUs sit idle. Relaxing constraints improves placement flexibility but risks degrading training throughput. Elastic scaling systems can defragment capacity dynamically.