Definition
GPU autoscaling operates across three distinct layers that must coordinate together: the pod layer (scales application replicas), the node layer (scales underlying GPU capacity), and the GPU allocation layer (decides whether workloads get full devices or fractional slices).
Different Timescales
Each layer operates on different timescales and uses different signals. Pod scaling reacts within seconds to minutes using metrics like queue depth, request concurrency, or p95 latency. Node scaling takes longer, typically 180+ seconds for node spin up plus GPU driver initialization plus model weight loading (which can be gigabytes). GPU allocation decisions happen at scheduling time, determining whether a small model gets 1/4 of a GPU through MIG or whether a large model needs an exclusive V100.
The Coordination Challenge
If your pod autoscaler scales up 10 new replicas but your cluster autoscaler hasn't added GPU nodes, those pods sit pending indefinitely. If you allocate fractional GPUs too aggressively, memory fragmentation causes OOM errors despite showing free capacity. Production systems use smooth multi minute averages (like 3 minute windows) combined with hysteresis to prevent oscillation where you scale up on a spike then immediately scale down when the batch completes.
Why Utilization Alone Misleads
A GPU can show 20% SM utilization while latency is terrible because the workload is memory bandwidth bound or bottlenecked on PCIe transfers. Effective monitoring combines SM occupancy, GPU memory usage, memory bandwidth utilization, plus end to end application latency to make informed scaling decisions. Relying on GPU utilization alone causes either over provisioning (wasting cost) or under provisioning (violating latency SLOs).
✓Pod layer scales replicas using queue depth, latency percentiles, or GPU utilization with 3 minute smoothing windows to prevent oscillation from batch workload spikes
✓Node layer adds GPU capacity with 180+ second cold start (node provision plus driver init plus model load time measured in gigabytes), requiring predictive warming for SLO compliance
✓GPU allocation layer chooses full device isolation for memory intensive models versus fractional MIG slices for small workloads to improve bin packing and utilization
✓Single metric autoscaling fails because 20% SM utilization can coexist with high latency when bottlenecked on memory bandwidth, PCIe transfers, or CPU preprocessing
✓Coordination across layers prevents placement deadlocks where pods scale up but cannot schedule because cluster autoscaler did not add matching GPU node types with correct taints
1Production configuration: min 1 to max 4 replicas per model, scaling when per pod GPU utilization exceeds 40% target averaged over 3 minutes using NVIDIA Data Center GPU Manager (DCGM) metrics
2Two GPU node groups: standard pool (max 5 nodes, on demand for reliability) and large pool (max 3 nodes, spot capacity for cost), both with scale to zero when idle
3Health check grace period of 180 seconds accommodates model weight loading; termination grace of 600 seconds allows in flight inference to drain before shutdown