Priority Preemption and Multi Tenant QoS Policies
Heterogeneous Workload Requirements
Production GPU clusters serve heterogeneous workloads with conflicting requirements: latency sensitive (LE) inference serving needs strict p95 and p99 SLOs, while best effort (BE) batch training prioritizes throughput and can tolerate interruptions. Priority preemption protects high priority workloads by evicting lower priority jobs when capacity is needed, but introduces operational complexity and wasted work.
Capacity Pool Separation
Effective policies separate capacity pools or implement strict preemption tiers. Organizations maintain dedicated serving pools with reserved capacity and allow training jobs to backfill unused serving capacity opportunistically. When serving load spikes, BE training is preempted immediately. The key is minimizing wasted progress: preempt only at mini batch or iteration boundaries after checkpoints are saved, typically every N steps or M seconds. Checkpointing a large language model can take 30 to 120 seconds.
Thrashing Prevention
Without safeguards, preemption causes thrashing: a job is killed, queued, restarted, then killed again before making meaningful progress. Systems implement promotion and aging: BE jobs that have been preempted multiple times or waited beyond a threshold are promoted to higher priority to ensure eventual completion. Gang scheduled distributed jobs are particularly vulnerable; preempting one worker in an 8 GPU job wastes the other 7 allocated GPUs until the entire gang is reassembled.
Balancing Protection and Efficiency
The tradeoff is serving SLO protection versus training throughput. Strict preemption guarantees that spikes in serving traffic never degrade latency, but aggressive policies can discard hours of training progress if checkpoints are infrequent. Systems achieve migration in under 4 seconds, allowing jobs to be moved to free capacity instead of killed. Operators tune preemption frequency caps (no more than X evictions per job per hour) and invest in fast checkpointing infrastructure.