ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Hard⏱️ ~2 min

Building Production GPU Orchestration: Discovery, Scheduling Extensions, and Reliability Operations

Production GPU orchestration requires decoupled control planes, each evolving independently, with specific focus on discovery, scheduling policy, and reliability operations. The discovery and advertisement layer runs agents on each node to list devices, memory size, interconnect capabilities (NVLink vs PCIe), health status, and isolation modes. This agent must handle dynamic changes such as enabling or reconfiguring Multi Instance GPU (MIG) partitions from whole device to seven 1g.5gb slices. It exposes allocatable units to the scheduler using node labels and extended resources. Node Feature Discovery labels nodes with instruction sets (AVX512), accelerators (specific GPU models), and fabric characteristics (InfiniBand presence). This metadata drives scheduling decisions. Scheduling policy extends the base Kubernetes scheduler with filters and scoring functions. Filters enforce minimum memory per device, node labels for GPU generation, and placement constraints like same rack or availability zone. Scoring functions prioritize topology. A typical implementation adds 100 points for placements maximizing intra node NVLink edges, 50 points for same rack reducing cross switch traffic, and negative scores for fragmentation producing patterns. Gang scheduling represents distributed training as groups with minimum cardinality, admitting only when the full set can be placed simultaneously. Reliability operations include health monitoring, quarantine workflows, and update orchestration. Quarantine bad GPUs via device or node level taints when thresholds trip: more than 5 ECC errors per hour, thermal throttling above 85 degrees Celsius, or repeated kernel timeouts. Roll driver updates gradually across fault domains, draining one node at a time and validating with conformance tests before proceeding. Track GPU idle time by team and enforce minimum utilization targets (commonly 60 to 70 percent) to unlock additional quota.
💡 Key Takeaways
Discovery agents detect GPUs, memory, interconnects, and health, then advertise via node labels and extended resources. Agents must handle dynamic MIG reconfiguration, publishing updated allocatable counts (changing from 8 whole GPUs to 56 instances) within 30 seconds.
Scheduling extensions implement filters for capability matching (minimum 40 GB memory) and scoring for topology (100 points for NVLink, 50 for same rack). Gang scheduling admits distributed jobs only when all requested GPUs with required topology are available.
Fractional GPU scheduling uses reservation pods that claim whole devices, then multiplexes fractional requests onto those reservations. On a 2 GPU node, four 0.5 fractional slices map to two reserved GPUs, avoiding conflicts with base scheduler logic.
Health controllers quarantine devices on threshold violations: more than 5 ECC errors per hour, sustained temperature above 85 degrees Celsius, or repeated kernel failures. Quarantine triggers automated tickets for hardware remediation and prevents new work assignment.
Update orchestration rolls driver and runtime changes across fault domains with validation gates. Drain one node, update components, run conformance tests covering whole GPU, MIG slices, and time slicing modes, then proceed to next node only on success.
Cost controls enforce GPU idle time limits by team, typically requiring 60 to 70 percent minimum utilization to access more quota. Separate node pools prevent non GPU workloads from landing on expensive GPU nodes. Spot instances for training reduce costs by 60 to 80 percent.
📌 Examples
An AI platform runs a discovery agent that detects when an operator reconfigures an A100 from whole device to 1g.5gb MIG profile. Within 30 seconds, the node advertises nvidia.com/mig-1g.5gb: 7 instead of nvidia.com/gpu: 1, and the scheduler begins placing lightweight inference workloads.
A custom scheduler extension scores 4 GPU placements. Single node with NVLink scores 400 points (100 per GPU pair). Two nodes in same rack scores 250 points. Two nodes in different racks scores 150 points. Scheduler picks single node, delivering 900 GB/s NVLink instead of 200 Gbps cross node network.
A health controller detects GPU 3 on node-47 with 8 ECC errors in 45 minutes and 88 degrees Celsius sustained temperature. It applies a taint, evicts 4 running pods which reschedule elsewhere within 60 seconds, and creates a hardware ticket. Before this system, flaky GPUs caused 15 percent of jobs to fail mysteriously.
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview