Training Infrastructure & Pipelines • GPU Allocation & Job SchedulingMedium⏱️ ~3 min
Priority Preemption and Multi Tenant QoS Policies
Production Graphics Processing Unit (GPU) clusters serve heterogeneous workloads with conflicting requirements: latency sensitive (LE) inference serving needs strict p95 and p99 Service Level Objectives (SLOs), while best effort (BE) batch training prioritizes throughput and can tolerate interruptions. Priority preemption protects high priority workloads by evicting lower priority jobs when capacity is needed, but introduces operational complexity and wasted work.
Effective policies separate capacity pools or implement strict preemption tiers. OpenAI and similar organizations maintain dedicated serving pools with reserved capacity and allow training jobs to backfill unused serving capacity opportunistically. When serving load spikes, BE training is preempted immediately. The key is minimizing wasted progress: preempt only at mini batch or iteration boundaries after checkpoints are saved, typically every N steps or M seconds. Checkpointing a large language model can take 30 to 120 seconds; more frequent saves reduce lost work but increase storage Input/Output (I/O) overhead and slow training.
Without safeguards, preemption causes thrashing: a job is killed, queued, restarted, then killed again before making meaningful progress. Systems like AntMan implement promotion and aging: BE jobs that have been preempted multiple times or waited beyond a threshold are promoted to higher priority to ensure eventual completion. Gang scheduled distributed jobs are particularly vulnerable; preempting one worker in an 8 GPU job wastes the other 7 allocated GPUs until the entire gang is reassembled.
The tradeoff is serving SLO protection versus training throughput. Strict preemption guarantees that spikes in serving traffic never degrade latency, but aggressive policies can discard hours of training progress if checkpoints are infrequent. Systems like Gandiva achieve migration in under 4 seconds, allowing jobs to be moved to free capacity instead of killed, but migration still disrupts progress and is expensive at scale. In practice, operators tune preemption frequency caps (no more than X evictions per job per hour) and invest in fast checkpointing infrastructure to balance protection and efficiency.
💡 Key Takeaways
•Priority preemption separates latency sensitive (LE) inference serving with strict SLOs from best effort (BE) training; LE reclaims capacity immediately when serving load spikes to protect p95 and p99 targets
•Preempt at mini batch or iteration boundaries after checkpoints to minimize wasted work; large language model checkpoints take 30 to 120 seconds, frequent saves reduce loss but increase storage I/O overhead
•Thrashing risk: jobs preempted repeatedly before making progress waste queue time and compute; promotion and aging policies (like AntMan) escalate BE priority after X preemptions or Y wait time to ensure completion
•Gang scheduled distributed jobs amplify waste: preempting 1 worker in an 8 GPU job idles the other 7 GPUs until entire gang is reassembled, multiplying the cost of eviction
•Operators tune preemption frequency caps (e.g., max 3 evictions per job per hour) and invest in fast checkpointing or sub 4 second migration (Gandiva) to balance SLO protection with training efficiency
📌 Examples
OpenAI production policy: Separate capacity pools for serving and training, strict preemption of BE training jobs when serving traffic spikes, checkpoint every N steps to bound lost work on eviction
AntMan scheduler: BE jobs preempted more than 3 times or waiting over 1 hour are promoted to higher priority tier, preventing starvation and ensuring eventual completion despite preemption pressure
Meta research cluster pattern: Backfilling short BE jobs in gaps between LE workloads, preempt at mini batch boundaries to reduce disruption, use fast checkpointing to minimize restart overhead