Model Serving & InferenceAutoscaling & GPU Resource ManagementHard⏱️ ~3 min

GPU Autoscaling Failure Modes: Oscillation, Placement, and Hidden Bottlenecks

GPU autoscaling fails in subtle ways that cause production incidents despite appearing correctly configured. The failure modes stem from the complex interaction between GPU hardware characteristics, Kubernetes scheduling constraints, metric collection delays, and multi layer coordination. Understanding these edge cases prevents outages and performance degradation. Metric induced oscillation occurs when instantaneous GPU utilization drives scaling decisions. A batch inference workload processes 100 requests together, causing GPU utilization to spike to 95% for 30 seconds then drop to 5% while waiting for the next batch. Naive autoscaling sees the 95% spike and adds two replicas. By the time the new replicas start (after 240 second cold start), the original batch completed and utilization reads 5%, triggering immediate scale down. This thrashing wastes cost and destabilizes the system. Production solutions use 3 to 5 minute rolling averages combined with asymmetric hysteresis: require 70% utilization sustained for 2 minutes before scaling up, but 30% sustained for 5 minutes before scaling down. Placement deadlocks trap GPU pods in pending state indefinitely despite autoscaling being "enabled." The cluster autoscaler only adds nodes that can satisfy pending pod constraints. If your GPU pod requires a specific node selector label or toleration that no node group provides, the autoscaler never adds capacity. Similarly, if GPU nodes lack proper taints, non GPU workloads can consume them, leaving GPU pods unscheduled. The symptom is pending pods with "Insufficient nvidia.com/gpu" events while the cluster autoscaler logs show "no available node groups can satisfy this pod." Hidden bottlenecks cause high latency despite low GPU utilization. A real production case saw p99 inference latency at 800ms with only 25% SM utilization. Investigation revealed the bottleneck was CPU preprocessing (tokenization and feature extraction taking 500ms) before GPU inference (taking 100ms). The autoscaler added GPU replicas uselessly because GPU capacity was not the constraint. Similarly, memory bandwidth saturation shows as normal SM utilization but terrible throughput. PCIe bottlenecks between CPU and GPU memory cause low GPU utilization while data transfers dominate. The fix requires monitoring correlated metrics: SM occupancy AND memory bandwidth AND end to end latency AND CPU utilization to identify the true constraint.
💡 Key Takeaways
Oscillation from instantaneous metrics causes thrashing when batch workload spikes to 95% utilization for 30 seconds then drops to 5%, triggering scale up then immediate scale down after 240 second cold start
Placement deadlocks occur when GPU pod constraints (like nodeSelector for A100) do not match any cluster autoscaler node group configuration, leaving pods pending despite autoscaling enabled
Hidden CPU bottlenecks cause 800ms p99 latency with only 25% GPU utilization because preprocessing takes 500ms before 100ms GPU inference, making GPU scaling ineffective
Memory bandwidth saturation shows normal Streaming Multiprocessor (SM) occupancy but terrible throughput when large language model inference is memory bound rather than compute bound, requiring bandwidth aware scaling signals
GPU memory fragmentation causes out of memory errors at allocation time despite showing free capacity because frequent model swaps or fractional allocations create non contiguous free space
📌 Examples
Production incident: batch inference oscillated between 1 and 4 replicas every 6 minutes due to 30 second batch processing pattern and instantaneous utilization metric; fixed with 5 minute rolling average and 5 minute scale down stabilization window
Placement deadlock: GPU pods pending for 2 hours because nodeSelector required label gpu-type=v100 but cluster autoscaler node groups only had label cloud.google.com/gke-accelerator=nvidia-tesla-v100 without custom gpu-type label
Performance debugging: p95 latency at 650ms with 30% GPU utilization revealed PCIe bottleneck transferring 500MB feature tensors from CPU to GPU memory taking 400ms per request; fixed by moving preprocessing to GPU
← Back to Autoscaling & GPU Resource Management Overview