Model Serving & InferenceAutoscaling & GPU Resource ManagementMedium⏱️ ~2 min

GPU Metrics: Beyond Utilization for Accurate Autoscaling

GPU utilization is the most misleading metric in autoscaling decisions. A GPU can report 90% SM utilization while delivering excellent throughput for batch workloads, or show 20% utilization while request latency violates Service Level Objectives (SLOs) because the bottleneck is elsewhere. Effective autoscaling requires combining multiple correlated signals to understand the true performance constraint. The key GPU hardware metrics are SM occupancy (what percentage of Streaming Multiprocessors are active), GPU memory usage (gigabytes consumed out of total device memory), and memory bandwidth utilization (percentage of theoretical peak bandwidth being used). A memory bound workload like large language model inference will show low SM utilization but high memory bandwidth and high latency. An Input/Output (I/O) bound workload shows low utilization on both while PCIe or network transfers dominate. Production systems export these via NVIDIA DCGM to a time series store, then compute derived metrics. Application level signals are equally important. Queue depth shows how many requests are waiting for GPU processing. Request concurrency measures in flight operations per replica. Latency percentiles (p95, p99) directly capture user experience. For Large Language Models (LLMs), tokens per second and time to first token are better throughput indicators than raw GPU utilization because they account for batching efficiency and memory access patterns. The robust approach uses smoothed windows to avoid noise. A 3 minute rolling average prevents scaling decisions based on momentary spikes when a large batch arrives or completes. Combine this with hysteresis: scale up quickly when metrics exceed thresholds (like 70% utilization sustained for 2 minutes) but scale down conservatively (like dropping below 30% for 5 minutes). This asymmetry prevents oscillation while staying responsive to genuine load increases.
💡 Key Takeaways
SM utilization alone misleads: 90% can mean healthy batch throughput or 20% can coexist with terrible latency when memory bandwidth (85% utilized) is the actual bottleneck
Memory bound LLM inference shows low SM occupancy but high memory bandwidth usage and high latency, requiring multi metric correlation to identify the constraint
Application signals like queue depth (45 requests waiting), p95 latency (320ms), and tokens per second (180) directly capture user experience better than hardware utilization
Smoothing with 3 minute rolling averages prevents oscillation from batch workload spikes that cause momentary 100% utilization followed by immediate drops to 10%
Asymmetric hysteresis scales up quickly (exceeding 70% for 2 minutes) but scales down conservatively (below 30% for 5 minutes) to avoid thrashing
📌 Examples
NVIDIA DCGM exports per GPU metrics (SM occupancy, memory usage, memory bandwidth, power draw, temperature) to Prometheus, then custom metrics adapter exposes per pod aggregates for Horizontal Pod Autoscaler (HPA)
LLM serving tracks tokens per second per replica: scaling target is 200 tokens/sec with 8 bit quantization and dynamic batching windows of 50ms to balance throughput and latency
Observed failure: autoscaler using only GPU utilization scaled down during memory bound phase showing 25% SM utilization, causing p99 latency to spike from 200ms to 1200ms
← Back to Autoscaling & GPU Resource Management Overview