Training Infrastructure & Pipelines • GPU Allocation & Job SchedulingHard⏱️ ~3 min
Implementation Patterns: Two Level Scheduling and Profiling Based Co-location
Production Graphics Processing Unit (GPU) schedulers implement two level architecture: global admission control with topology aware placement, and local device management with spatial or temporal sharing enforcement. The global layer runs gang scheduling, determines which jobs to admit based on priority tiers (latency sensitive vs best effort), and solves the topology constrained bin packing problem. It models the GPU to PCIe switch to CPU socket to node hierarchy (as in HiveD) and prefers packing within NVLink islands before spanning nodes. For all reduce heavy training, it keeps the entire gang within high bandwidth domains; for parameter server patterns, it places shards near Network Interface Card (NIC) rich nodes with Non Uniform Memory Access (NUMA) affinity.
The local layer enforces sharing policies: whole GPU or Multi Instance GPU (MIG) slice allocation for spatial isolation, or time slicing aligned to mini batch boundaries for temporal sharing. It manages Compute Unified Device Architecture (CUDA) context lifecycle to amortize creation cost (hundreds of milliseconds) and applies Multi Process Service (MPS) when beneficial for many small kernels. For Ahead of Time (AoT) execution, it warm up records once to build the operator Directed Acyclic Graph (DAG) and stream assignments, then replays the schedule every iteration to eliminate launch overhead. It pipelines transfers and compute using double or triple buffering: while stream 0 computes batch N, stream 1 stages batch N plus 1, and stream 2 copies results for batch N minus 1.
Co-location and interference management require profiling. Systems build an interference matrix by running representative job pairs or triples under different allocation shares, measuring compute utilization, memory bandwidth saturation, and latency. Profiling reveals which workloads are complementary: pairing a memory bound data preprocessing task with a compute bound training iteration can increase aggregate throughput by 30 to 50% compared to sequential execution. Mispairing two memory bound jobs causes mutual slowdown. A greedy or Integer Linear Programming (ILP) based bin packer uses the matrix to maximize throughput subject to Quality of Service (QoS) constraints.
Elastic scaling and workflow orchestration complete the system. Elastic jobs grow or shrink worker count at safe synchronization points (end of epoch or explicit barriers) to defragment capacity without destabilizing training. Directed Acyclic Graph (DAG) schedulers predict task durations from historical runs to pre-warm capacity and reduce idle gaps between dependent tasks. Critical path analysis prioritizes tasks that unblock many downstream operations, minimizing end to end pipeline latency.
💡 Key Takeaways
•Two level scheduling: global layer does gang admission, priority tiers, and topology aware placement modeling GPU to PCIe to socket hierarchy; local layer enforces spatial (MIG) or temporal (time slicing) sharing
•AoT execution warm up records one iteration to build operator DAG and stream assignments, replays schedule every iteration eliminating runtime launch overhead; pipelines Host to Device (H2D), compute, and Device to Host (D2H) using double or triple buffering
•Profiling based co-location builds interference matrix by running job pairs under different shares, measuring utilization and latency; pairing complementary workloads (memory bound plus compute bound) increases throughput by 30 to 50%
•Elastic scaling grows or shrinks worker count at epoch boundaries or explicit barriers to defragment capacity; DAG schedulers predict task durations and prioritize critical path to minimize pipeline latency
•Local layer manages CUDA context lifecycle (hundreds of milliseconds creation cost), applies MPS for concurrent small kernels, and enforces QoS isolation via MIG or priority streams
📌 Examples
HiveD global scheduler: Models 8 GPU NVLink islands as single allocation unit, places 16 GPU training job as 2 adjacent islands to minimize cross node traffic, queues until topology constraint satisfied
Profiling driven co-location: Pairing image decoding (memory bound, 40% GPU utilization) with ResNet training (compute bound, 95% SM utilization) on same GPU via MPS, achieving 1.4x aggregate throughput vs sequential
Elastic training pattern: 32 GPU job shrinks to 16 workers at epoch end when higher priority job arrives, releases 16 GPUs immediately, grows back to 32 when capacity freed, maintains training stability throughout