How to Detect GPU Underutilization in AI Inference Workloads

GPU utilization percentage is the most-watched metric in AI infrastructure — and the most misleading. Here's what to measure instead, and how to instrument your fleet to catch waste before it compounds.
Why GPU Utilization % Lies to You
The number you see in nvidia-smi — say, 87% GPU utilization — measures whether the GPU is doing something, not whether it's doing useful work at full capacity. A GPU can report high utilization while:
- Waiting on CPU preprocessing to finish
- Sitting idle between inference requests in a serverless setup
- Running at a fraction of its memory bandwidth capacity
- Processing a batch size so small it barely exercises the hardware
True underutilization hides behind a healthy-looking number.
---
The Four Metrics That Actually Matter
1. SM (Streaming Multiprocessor) Utilization
This is the compute utilization of the GPU cores themselves. Available via NVIDIA DCGM (DCGM_FI_DEV_GPU_UTIL). Anything consistently below 60% on an inference workload is a signal worth investigating.
2. Memory Bandwidth Utilization
GPUs are memory-bandwidth-bound for most LLM inference. If you're using less than 70% of available memory bandwidth (DCGM_FI_DEV_MEM_COPY_UTIL), you're leaving throughput on the table. Check this alongside SM utilization — a gap between the two usually means the CPU is the bottleneck.
3. GPU Memory Occupancy
High SM utilization with low memory occupancy often means your batch sizes are too small. The GPU is active but not saturated — you're paying for H100 capacity and getting A10G throughput.
4. Request Queue Depth + Inter-Request Idle Time
For inference specifically, the gap between requests is where utilization bleeds out. If your GPU is idle for 200ms between 50ms inference calls, your effective utilization is under 20% regardless of what nvidia-smi shows.
---
How to Instrument for Detection
Step 1 — Enable DCGM
NVIDIA's Data Center GPU Manager exposes the metrics above via Prometheus. If you're on Kubernetes, deploy dcgm-exporter as a DaemonSet. This gives you per-GPU, per-pod telemetry at 1-second resolution.
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporterStep 2 — Define Utilization Thresholds
Set alerts, not just dashboards. Suggested thresholds for inference workloads:
| Metric | Warning | Critical |
|---|---|---|
| SM Utilization (5-min avg) | < 50% | < 30% |
| Memory Bandwidth | < 60% | < 40% |
| Inter-request idle | > 150ms | > 300ms |
Step 3 — Correlate with CPU Metrics
GPU underutilization is almost always caused by something upstream. Add CPU utilization, tokenization latency, and data pipeline throughput to the same dashboard. If CPU is pegged at 100% when your GPU is at 40%, you've found your bottleneck.
Step 4 — Profile at the Model Level
Use nvtx markers or vLLM's built-in profiling to identify which phases of inference are causing idle time — prefill, decode, KV cache eviction, or scheduling overhead.
---
Common Causes and What to Do
| Root Cause | Signal | Fix |
|---|---|---|
| CPU-bound preprocessing | GPU idle, CPU high | Move tokenization to GPU or parallelize |
| Batch size too small | Low memory occupancy | Increase max batch size or use continuous batching |
| Serverless cold start | Idle spikes between requests | Pre-warm workers, tune scale-to-zero thresholds |
| Wrong GPU tier | Low SM util on high-memory GPU | Right-size to a smaller tier |
| KV cache pressure | High memory, low compute | Reduce context length or add KV cache offloading |
---
The Fleet-Level View
Detecting underutilization on a single GPU is one thing. At fleet scale — multiple clusters, mixed providers, dozens of models — the problem compounds. A model underutilizing a 4xH100 node by 40% is burning $12K/month in idle capacity. Multiply that across a fleet and it becomes the largest line item nobody is tracking.
Fleet-level detection requires aggregating per-GPU telemetry into a control plane that surfaces waste by workload, cluster, and tier — not just individual node dashboards.
See how Paralleliq surfaces underutilization across your inference fleet →
---
Next in the GPU Ops Field Guide: [OOM Root Cause for Inference Workloads →](/blog/gpu-ops-oom-root-cause)