GPU Metrics: Beyond Utilization for Accurate Autoscaling
Why GPU Utilization Misleads
GPU utilization is the most misleading metric in autoscaling decisions. A GPU can report 90% SM utilization while delivering excellent throughput for batch workloads, or show 20% utilization while request latency violates SLOs because the bottleneck is elsewhere. Effective autoscaling requires combining multiple correlated signals to understand the true performance constraint.
Key GPU Hardware Metrics
SM occupancy shows what percentage of Streaming Multiprocessors are active. GPU memory usage shows gigabytes consumed out of total device memory. Memory bandwidth utilization shows percentage of theoretical peak bandwidth being used. A memory bound workload like LLM inference will show low SM utilization but high memory bandwidth and high latency. An I/O bound workload shows low utilization on both while PCIe or network transfers dominate. Production systems export these via NVIDIA DCGM to a time series store.
Application Level Signals
Queue depth shows how many requests are waiting for GPU processing. Request concurrency measures in flight operations per replica. Latency percentiles (p95, p99) directly capture user experience. For LLMs, tokens per second and time to first token are better throughput indicators than raw GPU utilization because they account for batching efficiency and memory access patterns.
Smoothed Windows and Hysteresis
The robust approach uses 3 minute rolling averages to prevent scaling decisions based on momentary spikes when a large batch arrives or completes. Combine this with hysteresis: scale up quickly when metrics exceed thresholds (like 70% utilization sustained for 2 minutes) but scale down conservatively (like dropping below 30% for 5 minutes). This asymmetry prevents oscillation while staying responsive to genuine load increases.