Model Serving & InferenceAutoscaling & GPU Resource ManagementMedium⏱️ ~2 min

GPU Metrics: Beyond Utilization for Accurate Autoscaling

Why GPU Utilization Misleads

GPU utilization is the most misleading metric in autoscaling decisions. A GPU can report 90% SM utilization while delivering excellent throughput for batch workloads, or show 20% utilization while request latency violates SLOs because the bottleneck is elsewhere. Effective autoscaling requires combining multiple correlated signals to understand the true performance constraint.

Key GPU Hardware Metrics

SM occupancy shows what percentage of Streaming Multiprocessors are active. GPU memory usage shows gigabytes consumed out of total device memory. Memory bandwidth utilization shows percentage of theoretical peak bandwidth being used. A memory bound workload like LLM inference will show low SM utilization but high memory bandwidth and high latency. An I/O bound workload shows low utilization on both while PCIe or network transfers dominate. Production systems export these via NVIDIA DCGM to a time series store.

Application Level Signals

Queue depth shows how many requests are waiting for GPU processing. Request concurrency measures in flight operations per replica. Latency percentiles (p95, p99) directly capture user experience. For LLMs, tokens per second and time to first token are better throughput indicators than raw GPU utilization because they account for batching efficiency and memory access patterns.

Smoothed Windows and Hysteresis

The robust approach uses 3 minute rolling averages to prevent scaling decisions based on momentary spikes when a large batch arrives or completes. Combine this with hysteresis: scale up quickly when metrics exceed thresholds (like 70% utilization sustained for 2 minutes) but scale down conservatively (like dropping below 30% for 5 minutes). This asymmetry prevents oscillation while staying responsive to genuine load increases.

💡 Key Takeaways
SM utilization alone misleads: 90% can mean healthy batch throughput or 20% can coexist with terrible latency when memory bandwidth (85% utilized) is the actual bottleneck
Memory bound LLM inference shows low SM occupancy but high memory bandwidth usage and high latency, requiring multi metric correlation to identify the constraint
Application signals like queue depth (45 requests waiting), p95 latency (320ms), and tokens per second (180) directly capture user experience better than hardware utilization
Smoothing with 3 minute rolling averages prevents oscillation from batch workload spikes that cause momentary 100% utilization followed by immediate drops to 10%
Asymmetric hysteresis scales up quickly (exceeding 70% for 2 minutes) but scales down conservatively (below 30% for 5 minutes) to avoid thrashing
📌 Interview Tips
1NVIDIA DCGM exports per GPU metrics (SM occupancy, memory usage, memory bandwidth, power draw, temperature) to Prometheus, then custom metrics adapter exposes per pod aggregates for Horizontal Pod Autoscaler (HPA)
2LLM serving tracks tokens per second per replica: scaling target is 200 tokens/sec with 8 bit quantization and dynamic batching windows of 50ms to balance throughput and latency
3Observed failure: autoscaler using only GPU utilization scaled down during memory bound phase showing 25% SM utilization, causing p99 latency to spike from 200ms to 1200ms
← Back to Autoscaling & GPU Resource Management Overview
GPU Metrics: Beyond Utilization for Accurate Autoscaling | Autoscaling & GPU Resource Management - System Overflow