Training Infrastructure & PipelinesGPU Allocation & Job SchedulingHard⏱️ ~3 min

Topology Aware Gang Scheduling for Distributed Training

Why Gang Scheduling Matters

Distributed ML training requires gang scheduling: a job makes progress only when its entire worker set (all tensor parallel, model parallel, and data parallel ranks) is co-allocated simultaneously. A single missing GPU stalls the entire collective communication operation, wasting all other allocated resources. The challenge is placement: where to locate these workers to minimize communication latency and maximize bandwidth for synchronous operations like all reduce.

GPU Interconnect Hierarchy

GPU interconnect topology creates a hierarchy of bandwidth tiers. NVLink within a single node provides aggregate bandwidth of approximately 600 GB/s per A100 GPU. PCIe Gen4 x16 drops to approximately 32 GB/s per direction. Cross socket communication within a node traverses the CPU interconnect, adding latency. Cross node communication uses RDMA over Ethernet or InfiniBand. Placing an all reduce ring across PCIe instead of NVLink can degrade throughput by 2 to 10x for communication bound workloads.

Topology Aware Placement

Topology aware schedulers like HiveD model the GPU to PCIe switch to CPU socket to node hierarchy and assign affinity constraints. For synchronous data parallel training with all reduce, keep all workers within a single NVLink island (8 GPUs on modern NVIDIA servers). For larger jobs, pack workers to minimize cross socket and cross node hops. For parameter server architectures that tolerate higher latency, place parameter shards near NIC rich nodes and ensure NUMA affinity between NIC and GPU.

Utilization vs Throughput Trade-off

The tradeoff is utilization versus throughput. Strict topology constraints increase queue times and create fragmentation: a job needing 16 GPUs in 2 tightly coupled nodes may wait while 16 scattered GPUs sit idle. Relaxing constraints improves placement flexibility but risks degrading training throughput. Elastic scaling systems can defragment capacity dynamically.

💡 Key Takeaways
Gang scheduling requires entire worker set (all tensor parallel and data parallel ranks) to be co-allocated; a single missing GPU stalls collective operations and wastes all other allocated resources
NVLink provides approximately 600 GB/s aggregate bandwidth per A100 within a node; PCIe Gen4 x16 drops to approximately 32 GB/s per direction; misplacement across these tiers degrades throughput by 2 to 10x for communication bound training
Topology aware schedulers model GPU to PCIe switch to CPU socket hierarchy, keeping synchronous all reduce jobs within NVLink islands to minimize latency (sub 1ms vs 5 to 10ms cross node)
Strict topology constraints improve training throughput but increase queue times and fragmentation: 16 scattered idle GPUs cannot run a job needing 2 tightly coupled 8 GPU nodes
Elastic scaling (grow/shrink at epoch boundaries) and parameter server architectures with asynchronous updates tolerate relaxed placement, trading per iteration latency for higher utilization and shorter queue times
📌 Interview Tips
1Meta distributed training pattern: Data parallel ranks placed within NVLink islands on Zion and Grand Teton servers, minimizing cross socket and PCIe traffic for synchronous all reduce in large Distributed Data Parallel (DDP) and Mixture of Experts (MoE) training jobs
2HiveD scheduler: Models GPU to PCIe to socket hierarchy, assigns affinity constraints to keep 8 GPU training jobs within single NVLink domain, queues jobs until topology requirement met to preserve throughput
3OpenAI gang allocation: Model parallel and tensor parallel groups co-located in same rack or NVLink domain to minimize collective communication latency for transformer training at scale
← Back to GPU Allocation & Job Scheduling Overview