Training Infrastructure & PipelinesGPU Allocation & Job SchedulingMedium⏱️ ~3 min

GPU Allocation Fundamentals: Spatial vs Temporal Sharing

GPU allocation manages a scarce resource with unique constraints: VRAM capacity, memory bandwidth, interconnect topology (NVLink vs PCIe), and driver overhead. Two fundamental sharing models exist: spatial sharing assigns dedicated GPU resources (whole GPU, Multi Instance GPU slices, or device partitions) to jobs, while temporal sharing time slices a single GPU across multiple jobs. Spatial isolation provides predictable Quality of Service (QoS) and prevents interference between tenants. NVIDIA Multi Instance GPU (MIG) on A100 and H100 can partition one physical GPU into up to 7 isolated instances, each with dedicated Streaming Multiprocessors (SMs), VRAM, cache, and bandwidth. In production, operators run 7 concurrent inference tenants per A100 with strong latency guarantees. The tradeoff is fragmentation: if your queued jobs need 3 MIG slices but you have 2 available on one GPU and 2 on another, neither job can run despite having enough aggregate capacity. Temporal sharing improves utilization by switching between jobs at mini batch boundaries or fixed time quanta. This helps with fairness and packing mismatched workloads, but introduces cache pollution, Translation Lookaside Buffer (TLB) churn, and unpredictable latency tails. Default PyTorch execution can leave GPUs idle 70 to 90% of the time due to single stream First In First Out (FIFO) kernel submission and launch overhead. Multi stream execution and Ahead of Time (AoT) scheduling eliminate this waste by replaying pre recorded execution graphs, achieving up to 22.3x speedup for inference. The choice depends on workload class: use spatial isolation with MIG for multi tenant inference serving with strict Service Level Objectives (SLOs) under steady load; use temporal sharing for exploratory training, hyperparameter optimization, and short tasks where fairness matters more than per iteration latency.
💡 Key Takeaways
Spatial sharing with MIG provides up to 7 isolated GPU instances per A100/H100, each with dedicated SMs, VRAM, cache, and bandwidth for predictable multi tenant inference serving
Temporal sharing improves utilization and fairness but default PyTorch leaves GPUs idle 70 to 90% of time; multi stream execution with AoT scheduling achieves up to 22.3x speedup
Fragmentation is the primary cost of spatial isolation: stranded MIG slices or whole GPUs that do not match queued job requirements leave capacity unused despite aggregate availability
Temporal sharing introduces cache pollution and TLB churn on context switches, causing latency tails that can violate strict SLOs for latency sensitive serving workloads
Production choice: spatial isolation for serving with steady load and strict p95/p99 targets; temporal sharing for training, hyperparameter optimization, and exploratory workloads prioritizing fairness
📌 Examples
NVIDIA production pattern: 7 concurrent inference tenants per A100 using MIG, each with isolated 10GB VRAM slice and dedicated compute, achieving strong QoS without interference
Nomad case study: 8×A800 80GB node ran 66 short training jobs (1 GPU each, ~2.5 min), scheduler queued excess beyond 8 concurrent jobs, achieved 100% completion with zero oversubscription and immediate GPU release on task completion
← Back to GPU Allocation & Job Scheduling Overview
GPU Allocation Fundamentals: Spatial vs Temporal Sharing | GPU Allocation & Job Scheduling - System Overflow