GPU Allocation Fundamentals: Spatial vs Temporal Sharing
Spatial Isolation Benefits
Spatial isolation provides predictable QoS and prevents interference between tenants. NVIDIA Multi Instance GPU (MIG) on A100 and H100 can partition one physical GPU into up to 7 isolated instances, each with dedicated Streaming Multiprocessors (SMs), VRAM, cache, and bandwidth. In production, operators run 7 concurrent inference tenants per A100 with strong latency guarantees. The tradeoff is fragmentation: if your queued jobs need 3 MIG slices but you have 2 available on one GPU and 2 on another, neither job can run despite having enough aggregate capacity.
Temporal Sharing Trade-offs
Temporal sharing improves utilization by switching between jobs at mini batch boundaries or fixed time quanta. This helps with fairness and packing mismatched workloads, but introduces cache pollution, TLB churn, and unpredictable latency tails. Default PyTorch execution can leave GPUs idle 70 to 90 percent of the time due to single stream FIFO kernel submission and launch overhead.
Choosing the Right Model
Use spatial isolation with MIG for multi tenant inference serving with strict SLOs under steady load; use temporal sharing for exploratory training, hyperparameter optimization, and short tasks where fairness matters more than per iteration latency.