GPU Autoscaling Failure Modes: Oscillation, Placement, and Hidden Bottlenecks

Metric Induced Oscillation
Occurs when instantaneous GPU utilization drives scaling decisions. A batch inference workload processes 100 requests together, causing GPU utilization to spike to 95% for 30 seconds then drop to 5% while waiting for the next batch. Naive autoscaling sees the 95% spike and adds two replicas. By the time the new replicas start (after 240 second cold start), the original batch completed and utilization reads 5%, triggering immediate scale down. This thrashing wastes cost and destabilizes the system. Production solutions use 3 to 5 minute rolling averages combined with asymmetric hysteresis.
Placement Deadlocks
Trap GPU pods in pending state indefinitely despite autoscaling being enabled. The cluster autoscaler only adds nodes that can satisfy pending pod constraints. If your GPU pod requires a specific node selector label or toleration that no node group provides, the autoscaler never adds capacity. Similarly, if GPU nodes lack proper taints, non GPU workloads can consume them, leaving GPU pods unscheduled. The symptom is pending pods with "Insufficient nvidia.com/gpu" events while the cluster autoscaler logs show "no available node groups can satisfy this pod."
Hidden Bottlenecks
Cause high latency despite low GPU utilization. A real production case saw p99 inference latency at 800ms with only 25% SM utilization. Investigation revealed the bottleneck was CPU preprocessing (tokenization and feature extraction taking 500ms) before GPU inference (taking 100ms). The autoscaler added GPU replicas uselessly because GPU capacity was not the constraint.
Memory Bandwidth Saturation
Shows as normal SM utilization but terrible throughput. PCIe bottlenecks between CPU and GPU memory cause low GPU utilization while data transfers dominate. The fix requires monitoring correlated metrics: SM occupancy AND memory bandwidth AND end to end latency AND CPU utilization to identify the true constraint before scaling.

💡 Key Takeaways

✓Oscillation from instantaneous metrics causes thrashing when batch workload spikes to 95% utilization for 30 seconds then drops to 5%, triggering scale up then immediate scale down after 240 second cold start

✓Placement deadlocks occur when GPU pod constraints (like nodeSelector for A100) do not match any cluster autoscaler node group configuration, leaving pods pending despite autoscaling enabled

✓Hidden CPU bottlenecks cause 800ms p99 latency with only 25% GPU utilization because preprocessing takes 500ms before 100ms GPU inference, making GPU scaling ineffective

✓Memory bandwidth saturation shows normal Streaming Multiprocessor (SM) occupancy but terrible throughput when large language model inference is memory bound rather than compute bound, requiring bandwidth aware scaling signals

✓GPU memory fragmentation causes out of memory errors at allocation time despite showing free capacity because frequent model swaps or fractional allocations create non contiguous free space

📌 Interview Tips

1Production incident: batch inference oscillated between 1 and 4 replicas every 6 minutes due to 30 second batch processing pattern and instantaneous utilization metric; fixed with 5 minute rolling average and 5 minute scale down stabilization window

2Placement deadlock: GPU pods pending for 2 hours because nodeSelector required label gpu-type=v100 but cluster autoscaler node groups only had label cloud.google.com/gke-accelerator=nvidia-tesla-v100 without custom gpu-type label

3Performance debugging: p95 latency at 650ms with 30% GPU utilization revealed PCIe bottleneck transferring 500MB feature tensors from CPU to GPU memory taking 400ms per request; fixed by moving preprocessing to GPU

← Back to Autoscaling & GPU Resource Management Overview