Model Serving & InferenceAutoscaling & GPU Resource ManagementMedium⏱️ ~3 min

GPU Autoscaling: Multi Layer Control Architecture

GPU autoscaling is fundamentally different from CPU autoscaling because it operates across three distinct layers that must coordinate together. The pod layer scales application replicas based on demand signals. The node layer scales the underlying GPU capacity by adding or removing nodes with GPUs attached. The GPU allocation layer decides whether workloads get full exclusive devices or fractional slices through virtualization. Each layer operates on different timescales and uses different signals. Pod scaling reacts within seconds to minutes using metrics like queue depth, request concurrency, or p95 latency. Node scaling takes longer, typically 180+ seconds for node spin up plus GPU driver initialization plus model weight loading (which can be gigabytes). GPU allocation decisions happen at scheduling time, determining whether a small model gets 1/4 of a GPU through Multi Instance GPU (MIG) or whether a large model needs an exclusive Tesla V100. The coordination challenge is critical. If your pod autoscaler scales up 10 new replicas but your cluster autoscaler hasn't added GPU nodes, those pods sit pending indefinitely. If you allocate fractional GPUs too aggressively, memory fragmentation causes out of memory errors despite showing free capacity. Production systems use smooth multi minute averages (like 3 minute windows) combined with hysteresis to prevent oscillation where you scale up on a spike then immediately scale down when the batch completes. Real implementations track multiple correlated signals because GPU "utilization" alone misleads. A GPU can show 20% Streaming Multiprocessor (SM) utilization while latency is terrible because the workload is memory bandwidth bound or bottlenecked on PCIe transfers. Effective monitoring combines SM occupancy, GPU memory usage, memory bandwidth utilization, plus end to end application latency to make informed scaling decisions.
💡 Key Takeaways
Pod layer scales replicas using queue depth, latency percentiles, or GPU utilization with 3 minute smoothing windows to prevent oscillation from batch workload spikes
Node layer adds GPU capacity with 180+ second cold start (node provision plus driver init plus model load time measured in gigabytes), requiring predictive warming for SLO compliance
GPU allocation layer chooses full device isolation for memory intensive models versus fractional MIG slices for small workloads to improve bin packing and utilization
Single metric autoscaling fails because 20% SM utilization can coexist with high latency when bottlenecked on memory bandwidth, PCIe transfers, or CPU preprocessing
Coordination across layers prevents placement deadlocks where pods scale up but cannot schedule because cluster autoscaler did not add matching GPU node types with correct taints
📌 Examples
Production configuration: min 1 to max 4 replicas per model, scaling when per pod GPU utilization exceeds 40% target averaged over 3 minutes using NVIDIA Data Center GPU Manager (DCGM) metrics
Two GPU node groups: standard pool (max 5 nodes, on demand for reliability) and large pool (max 3 nodes, spot capacity for cost), both with scale to zero when idle
Health check grace period of 180 seconds accommodates model weight loading; termination grace of 600 seconds allows in flight inference to drain before shutdown
← Back to Autoscaling & GPU Resource Management Overview
GPU Autoscaling: Multi Layer Control Architecture | Autoscaling & GPU Resource Management - System Overflow