ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Medium⏱️ ~2 min

GPU Partitioning Patterns: Whole Device vs Time Slicing vs Hardware Partitioning

Three core patterns exist for partitioning GPU resources, each with distinct trade-offs for utilization, isolation, and performance predictability. Whole device allocation gives a container exclusive control of an entire GPU. This is simple to implement, provides completely predictable performance, and eliminates interference. However, it wastes capacity when workloads use less than 20 percent of GPU compute or memory. A small vision model optimized with TensorRT might consume only 2 GB of an 80 GB A100, leaving 78 GB idle. Time slicing multiplexes GPU kernels across multiple processes using Multi Process Service (MPS) or similar technologies. This can double or triple utilization for bursty inference workloads by filling idle cycles. The downside is interference risk. When one tenant launches heavy kernels that saturate streaming multiprocessors or memory bandwidth, other tenants see 2x to 5x p99 latency spikes. For hard real time inference with SLOs under 100 milliseconds, this unpredictability is unacceptable. Hardware partitioning creates isolated instances on a single physical card. NVIDIA's Multi Instance GPU (MIG) on A100 or H100 divides one GPU into up to seven instances, each with dedicated memory slices and compute resources. A single A100 can be partitioned into seven 1g.5gb instances, each providing 5 GB of memory and one seventh of compute. This delivers true Quality of Service (QoS) guarantees, memory protection, and fault containment. One instance crashing does not affect others. The limitation is fixed slice sizes. If most jobs need 10 GB, the 5 GB slices create fragmentation and stranded capacity.
💡 Key Takeaways
Whole GPU allocation is best for high utilization training (typically above 70 percent GPU duty cycle) and strict SLO inference where predictability matters more than packing density. Use this when debugging performance issues or running production models with tight latency budgets.
Time slicing works well for many small models with soft latency SLOs, such as batch processing pipelines or development environments. It increases utilization from 20 percent to 60 percent or higher, but requires careful workload classification to avoid mixing real time with batch.
Multi Instance GPU delivers true isolation with per slice memory protection and separate fault domains. An A100 in 1g.5gb profile provides seven instances, each handling 100 to 300 images per second at batch size 8 with p99 latency around 50 to 80 milliseconds for optimized vision models.
MIG slice configurations are fixed menus, not arbitrary. Common profiles include 1g.5gb (7 slices), 2g.10gb (3 slices), 3g.20gb (2 slices), and 7g.40gb (1 whole GPU). Misalignment between job sizes and available profiles causes fragmentation.
Virtual GPU (vGPU) at the hypervisor layer enables strong isolation across tenants and VM based security domains, but adds license costs (often $1,000 to $2,000 per GPU annually) and 5 to 10 percent overhead compared to container level sharing.
Choose whole GPUs for training and strict SLO inference. Choose MIG for multi tenant isolation with predictable performance. Choose time slicing only for soft SLO workloads where 2x to 5x tail latency variance is acceptable.
📌 Examples
A Triton Inference Server deployment uses MIG slices in 1g.5gb profile to serve 20 different models in isolation. Each model gets dedicated memory and compute, preventing one model's traffic spike from affecting others. Total throughput reaches 2,000 to 6,000 requests per second per A100 depending on model complexity.
An AI platform runs lightweight classification models on time sliced GPUs, achieving 4x higher pod density. When a batch job saturates the GPU, real time inference latency jumps from 40 milliseconds to 180 milliseconds, violating SLOs. The team moves real time workloads to dedicated MIG slices.
A large scale training job reserves 512 whole A100 GPUs using gang scheduling. Each GPU runs at 85 to 95 percent utilization for 12 hours. Sharing would reduce per GPU throughput by 15 to 30 percent due to context switching overhead, making whole device allocation the only viable choice.
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview