Learn→Training Infrastructure & Pipelines→GPU Allocation & Job Scheduling→4 of 6

Training Infrastructure & Pipelines • GPU Allocation & Job SchedulingMedium⏱️ ~3 min

Priority Preemption and Multi Tenant QoS Policies

Heterogeneous Workload Requirements
Production GPU clusters serve heterogeneous workloads with conflicting requirements: latency sensitive (LE) inference serving needs strict p95 and p99 SLOs, while best effort (BE) batch training prioritizes throughput and can tolerate interruptions. Priority preemption protects high priority workloads by evicting lower priority jobs when capacity is needed, but introduces operational complexity and wasted work.
Capacity Pool Separation
Effective policies separate capacity pools or implement strict preemption tiers. Organizations maintain dedicated serving pools with reserved capacity and allow training jobs to backfill unused serving capacity opportunistically. When serving load spikes, BE training is preempted immediately. The key is minimizing wasted progress: preempt only at mini batch or iteration boundaries after checkpoints are saved, typically every N steps or M seconds. Checkpointing a large language model can take 30 to 120 seconds.
Thrashing Prevention
Without safeguards, preemption causes thrashing: a job is killed, queued, restarted, then killed again before making meaningful progress. Systems implement promotion and aging: BE jobs that have been preempted multiple times or waited beyond a threshold are promoted to higher priority to ensure eventual completion. Gang scheduled distributed jobs are particularly vulnerable; preempting one worker in an 8 GPU job wastes the other 7 allocated GPUs until the entire gang is reassembled.
Balancing Protection and Efficiency
The tradeoff is serving SLO protection versus training throughput. Strict preemption guarantees that spikes in serving traffic never degrade latency, but aggressive policies can discard hours of training progress if checkpoints are infrequent. Systems achieve migration in under 4 seconds, allowing jobs to be moved to free capacity instead of killed. Operators tune preemption frequency caps (no more than X evictions per job per hour) and invest in fast checkpointing infrastructure.

💡 Key Takeaways

✓Priority preemption separates latency sensitive (LE) inference serving with strict SLOs from best effort (BE) training; LE reclaims capacity immediately when serving load spikes to protect p95 and p99 targets

✓Preempt at mini batch or iteration boundaries after checkpoints to minimize wasted work; large language model checkpoints take 30 to 120 seconds, frequent saves reduce loss but increase storage I/O overhead

✓Thrashing risk: jobs preempted repeatedly before making progress waste queue time and compute; promotion and aging policies (like AntMan) escalate BE priority after X preemptions or Y wait time to ensure completion

✓Gang scheduled distributed jobs amplify waste: preempting 1 worker in an 8 GPU job idles the other 7 GPUs until entire gang is reassembled, multiplying the cost of eviction

✓Operators tune preemption frequency caps (e.g., max 3 evictions per job per hour) and invest in fast checkpointing or sub 4 second migration (Gandiva) to balance SLO protection with training efficiency

📌 Interview Tips

1OpenAI production policy: Separate capacity pools for serving and training, strict preemption of BE training jobs when serving traffic spikes, checkpoint every N steps to bound lost work on eviction

2AntMan scheduler: BE jobs preempted more than 3 times or waiting over 1 hour are promoted to higher priority tier, preventing starvation and ensuring eventual completion despite preemption pressure

3Meta research cluster pattern: Backfilling short BE jobs in gaps between LE workloads, preempt at mini batch boundaries to reduce disruption, use fast checkpointing to minimize restart overhead

← Back to GPU Allocation & Job Scheduling Overview