Learn→Training Infrastructure & Pipelines→GPU Allocation & Job Scheduling→1 of 6

Training Infrastructure & Pipelines • GPU Allocation & Job SchedulingMedium⏱️ ~3 min

GPU Allocation Fundamentals: Spatial vs Temporal Sharing

Definition
GPU allocation manages a scarce resource with unique constraints: VRAM capacity, memory bandwidth, and interconnect topology (NVLink vs PCIe). Two fundamental sharing models exist: spatial sharing assigns dedicated GPU resources to jobs, while temporal sharing time slices a single GPU across multiple jobs.
Spatial Isolation Benefits
Spatial isolation provides predictable QoS and prevents interference between tenants. NVIDIA Multi Instance GPU (MIG) on A100 and H100 can partition one physical GPU into up to 7 isolated instances, each with dedicated Streaming Multiprocessors (SMs), VRAM, cache, and bandwidth. In production, operators run 7 concurrent inference tenants per A100 with strong latency guarantees. The tradeoff is fragmentation: if your queued jobs need 3 MIG slices but you have 2 available on one GPU and 2 on another, neither job can run despite having enough aggregate capacity.
Temporal Sharing Trade-offs
Temporal sharing improves utilization by switching between jobs at mini batch boundaries or fixed time quanta. This helps with fairness and packing mismatched workloads, but introduces cache pollution, TLB churn, and unpredictable latency tails. Default PyTorch execution can leave GPUs idle 70 to 90 percent of the time due to single stream FIFO kernel submission and launch overhead.
Choosing the Right Model
Use spatial isolation with MIG for multi tenant inference serving with strict SLOs under steady load; use temporal sharing for exploratory training, hyperparameter optimization, and short tasks where fairness matters more than per iteration latency.

💡 Key Takeaways

✓Spatial sharing with MIG provides up to 7 isolated GPU instances per A100/H100, each with dedicated SMs, VRAM, cache, and bandwidth for predictable multi tenant inference serving

✓Temporal sharing improves utilization and fairness but default PyTorch leaves GPUs idle 70 to 90% of time; multi stream execution with AoT scheduling achieves up to 22.3x speedup

✓Fragmentation is the primary cost of spatial isolation: stranded MIG slices or whole GPUs that do not match queued job requirements leave capacity unused despite aggregate availability

✓Temporal sharing introduces cache pollution and TLB churn on context switches, causing latency tails that can violate strict SLOs for latency sensitive serving workloads

✓Production choice: spatial isolation for serving with steady load and strict p95/p99 targets; temporal sharing for training, hyperparameter optimization, and exploratory workloads prioritizing fairness

📌 Interview Tips

1NVIDIA production pattern: 7 concurrent inference tenants per A100 using MIG, each with isolated 10GB VRAM slice and dedicated compute, achieving strong QoS without interference

2Nomad case study: 8×A800 80GB node ran 66 short training jobs (1 GPU each, ~2.5 min), scheduler queued excess beyond 8 concurrent jobs, achieved 100% completion with zero oversubscription and immediate GPU release on task completion

← Back to GPU Allocation & Job Scheduling Overview