ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Medium⏱️ ~3 min

Topology Aware Scheduling and Gang Scheduling for Distributed Training

Topology-Aware Scheduling: Placing workloads on GPUs based on physical interconnect topology, not just availability. GPUs on the same node connected via NVLink communicate orders of magnitude faster than GPUs across nodes connected via network. Ignoring topology can slow distributed training by 2-5x.

Why Topology Matters

Data parallel training requires frequent gradient synchronization across GPUs. With 4 GPUs on the same node (NVLink, 600 GB/s), an all-reduce operation completing in 10ms might take 100ms across nodes (25 Gbps network). For a training step taking 200ms of compute, intra-node communication adds 5% overhead; inter-node adds 50%. The scheduler must understand: which GPUs share NVLink, which share PCIe switches, which require network hops. Placing a 4-GPU job across 4 nodes when 4 GPUs are available on one node is a performance disaster.

Gang Scheduling

Distributed training jobs need all their GPUs simultaneously. If a job requests 8 GPUs and only 6 are available, it cannot start—unlike CPU jobs that can run with partial allocation. Gang scheduling ensures all resources for a job are allocated atomically: either the job gets all 8 GPUs or it waits. Without gang scheduling, partial allocations cause deadlocks: Job A holds 4 GPUs waiting for 4 more, Job B holds 4 GPUs waiting for 4 more, neither can proceed.

Scheduler Implementation

Standard Kubernetes schedulers are not topology-aware. They see GPUs as identical resources. ML schedulers extend this with: topology detection (discovering NVLink connections, PCIe topology, network layout), affinity rules (prefer co-located GPUs), and gang semantics (all-or-nothing allocation). The scheduler maintains a topology graph and solves a bin-packing problem: fit jobs onto the cluster while respecting topology preferences and resource constraints.

Performance Impact: Properly topology-aware placement can improve distributed training throughput by 30-50% compared to naive scheduling, with no changes to the training code itself.

💡 Key Takeaways
NVLink (600 GB/s) vs network (25 Gbps) makes topology critical for gradient sync
Gang scheduling prevents deadlock by allocating all GPUs atomically
Proper topology placement improves training throughput 30-50% with no code changes
📌 Interview Tips
1All-reduce: 10ms intra-node (NVLink) vs 100ms inter-node (network)
28-GPU job deadlock: A holds 4 waiting for 4, B holds 4 waiting for 4
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview
Topology Aware Scheduling and Gang Scheduling for Distributed Training | Resource Orchestration (Kubernetes, GPU Scheduling) - System Overflow