GPU Autoscaling Failure Modes: Oscillation, Placement, and Hidden Bottlenecks
Metric Induced Oscillation
Occurs when instantaneous GPU utilization drives scaling decisions. A batch inference workload processes 100 requests together, causing GPU utilization to spike to 95% for 30 seconds then drop to 5% while waiting for the next batch. Naive autoscaling sees the 95% spike and adds two replicas. By the time the new replicas start (after 240 second cold start), the original batch completed and utilization reads 5%, triggering immediate scale down. This thrashing wastes cost and destabilizes the system. Production solutions use 3 to 5 minute rolling averages combined with asymmetric hysteresis.
Placement Deadlocks
Trap GPU pods in pending state indefinitely despite autoscaling being enabled. The cluster autoscaler only adds nodes that can satisfy pending pod constraints. If your GPU pod requires a specific node selector label or toleration that no node group provides, the autoscaler never adds capacity. Similarly, if GPU nodes lack proper taints, non GPU workloads can consume them, leaving GPU pods unscheduled. The symptom is pending pods with "Insufficient nvidia.com/gpu" events while the cluster autoscaler logs show "no available node groups can satisfy this pod."
Hidden Bottlenecks
Cause high latency despite low GPU utilization. A real production case saw p99 inference latency at 800ms with only 25% SM utilization. Investigation revealed the bottleneck was CPU preprocessing (tokenization and feature extraction taking 500ms) before GPU inference (taking 100ms). The autoscaler added GPU replicas uselessly because GPU capacity was not the constraint.
Memory Bandwidth Saturation
Shows as normal SM utilization but terrible throughput. PCIe bottlenecks between CPU and GPU memory cause low GPU utilization while data transfers dominate. The fix requires monitoring correlated metrics: SM occupancy AND memory bandwidth AND end to end latency AND CPU utilization to identify the true constraint before scaling.