Building Production GPU Orchestration: Discovery, Scheduling Extensions, and Reliability Operations
Production GPU Orchestration Stack: A complete GPU orchestration system includes device discovery, extended scheduling, quota management, and operational tooling. Each layer builds on Kubernetes primitives but requires ML-specific extensions.
Device Discovery and Registration
The device plugin framework allows GPUs to appear as schedulable resources. Plugins run on each node, detect available GPUs, and report them to the scheduler. Beyond basic count, ML clusters need: GPU model (A100 vs V100), memory capacity, topology information (NVLink connections), and health status. This metadata enables intelligent scheduling decisions. Without rich discovery, the scheduler sees only "4 GPUs available" with no ability to distinguish between generations or configurations.
Scheduling Extensions
Default schedulers are insufficient for ML workloads. Extensions include: Gang scheduler: Allocates all GPUs for a job atomically, preventing deadlock. Topology scheduler: Prefers GPUs with fast interconnects for multi-GPU jobs. Preemption controller: Allows high-priority jobs to evict lower-priority workloads. Quota manager: Enforces per-team GPU limits to prevent resource hogging. These components integrate with the core scheduler via extension points (scheduling framework, webhooks).
Operational Tooling
Running GPU clusters requires specialized observability. Metrics to collect: per-GPU utilization, memory usage, temperature, power draw, ECC errors, and per-job GPU time. Dashboards should show: cluster-wide utilization (target 80%+), queue depth by priority, fragmentation ratio, and unhealthy GPU count. Automation: auto-drain nodes with failing GPUs, auto-scale node pools based on queue depth, auto-terminate jobs exceeding time limits. Without this tooling, operators cannot effectively manage GPU resources at scale.
Integration Path: Start with device plugin for discovery, add gang scheduling for training workloads, implement quota management for multi-tenancy, then add topology-awareness as cluster grows.