Learn→ML Infrastructure & MLOps→Resource Orchestration (Kubernetes, GPU Scheduling)→2 of 6

ML Infrastructure & MLOps • Resource Orchestration (Kubernetes, GPU Scheduling)Medium⏱️ ~2 min

GPU Partitioning Patterns: Whole Device vs Time Slicing vs Hardware Partitioning

GPU Partitioning: Strategies for sharing GPU resources among multiple workloads. Options range from exclusive whole-device allocation (simplest, lowest utilization) to hardware-level partitioning (complex, highest utilization). The right choice depends on workload characteristics and isolation requirements.
Whole Device Allocation
Each workload gets exclusive access to entire GPUs. Simple to implement: container requests "1 GPU" and receives a whole device. Advantages: strong isolation, predictable performance, no resource contention. Disadvantages: poor utilization for small models. An inference job using 2GB of a 40GB GPU wastes 95% of the memory. Best for: training workloads that fully utilize GPU memory and compute, latency-sensitive inference where any contention is unacceptable.
Time Slicing
Multiple workloads share a GPU by taking turns. The GPU rapidly switches between contexts (typically every 10-100ms). Advantages: improves utilization for bursty workloads, no hardware support required. Disadvantages: context switching overhead (5-15% throughput loss), memory is not shared (all workloads must fit simultaneously), latency variance increases. Best for: development environments, batch inference with flexible latency, workloads that are memory-bound rather than compute-bound.
Hardware Partitioning (MIG)
Multi-Instance GPU divides a single GPU into isolated partitions at the hardware level. Each partition has dedicated compute units, memory, and cache. Unlike time slicing, partitions run simultaneously with guaranteed resources. An A100 80GB can partition into 7 instances of 10GB each. Advantages: true isolation (one workload cannot affect another), guaranteed resources, simultaneous execution. Disadvantages: requires supported hardware, partition sizes are fixed (cannot create arbitrary splits), reconfiguration requires workload evacuation.
Decision Framework: Production inference with SLA: whole device or MIG. Development and experimentation: time slicing. Memory-light workloads on expensive GPUs: MIG to maximize utilization.

💡 Key Takeaways

✓Whole device: simple but wastes resources for small models (2GB on 40GB GPU)

✓Time slicing: 5-15% overhead, memory not shared, increases latency variance

✓MIG provides hardware isolation with guaranteed resources but fixed partition sizes

📌 Interview Tips

1A100 80GB partitions into 7 MIG instances of 10GB each

2Time slicing switches context every 10-100ms with 5-15% throughput loss

← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview