ParallelIQ
GPU Ops Field Guide

How to Detect GPU Underutilization in AI Inference Workloads

By Sam Hosseini·May 16, 2026·7 min read
How to Detect GPU Underutilization in AI Inference Workloads

GPU utilization percentage is the most-watched metric in AI infrastructure — and the most misleading. Here's what to measure instead, and how to instrument your fleet to catch waste before it compounds.

Why GPU Utilization % Lies to You

The number you see in nvidia-smi — say, 87% GPU utilization — measures whether the GPU is doing something, not whether it's doing useful work at full capacity. A GPU can report high utilization while:

  • Waiting on CPU preprocessing to finish
  • Sitting idle between inference requests in a serverless setup
  • Running at a fraction of its memory bandwidth capacity
  • Processing a batch size so small it barely exercises the hardware

True underutilization hides behind a healthy-looking number.

---

The Four Metrics That Actually Matter

1. SM (Streaming Multiprocessor) Utilization

This is the compute utilization of the GPU cores themselves. Available via NVIDIA DCGM (DCGM_FI_DEV_GPU_UTIL). Anything consistently below 60% on an inference workload is a signal worth investigating.

2. Memory Bandwidth Utilization

GPUs are memory-bandwidth-bound for most LLM inference. If you're using less than 70% of available memory bandwidth (DCGM_FI_DEV_MEM_COPY_UTIL), you're leaving throughput on the table. Check this alongside SM utilization — a gap between the two usually means the CPU is the bottleneck.

3. GPU Memory Occupancy

High SM utilization with low memory occupancy often means your batch sizes are too small. The GPU is active but not saturated — you're paying for H100 capacity and getting A10G throughput.

4. Request Queue Depth + Inter-Request Idle Time

For inference specifically, the gap between requests is where utilization bleeds out. If your GPU is idle for 200ms between 50ms inference calls, your effective utilization is under 20% regardless of what nvidia-smi shows.

---

How to Instrument for Detection

Step 1 — Enable DCGM

NVIDIA's Data Center GPU Manager exposes the metrics above via Prometheus. If you're on Kubernetes, deploy dcgm-exporter as a DaemonSet. This gives you per-GPU, per-pod telemetry at 1-second resolution.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

Step 2 — Define Utilization Thresholds

Set alerts, not just dashboards. Suggested thresholds for inference workloads:

MetricWarningCritical
SM Utilization (5-min avg)< 50%< 30%
Memory Bandwidth< 60%< 40%
Inter-request idle> 150ms> 300ms

Step 3 — Correlate with CPU Metrics

GPU underutilization is almost always caused by something upstream. Add CPU utilization, tokenization latency, and data pipeline throughput to the same dashboard. If CPU is pegged at 100% when your GPU is at 40%, you've found your bottleneck.

Step 4 — Profile at the Model Level

Use nvtx markers or vLLM's built-in profiling to identify which phases of inference are causing idle time — prefill, decode, KV cache eviction, or scheduling overhead.

---

Common Causes and What to Do

Root CauseSignalFix
CPU-bound preprocessingGPU idle, CPU highMove tokenization to GPU or parallelize
Batch size too smallLow memory occupancyIncrease max batch size or use continuous batching
Serverless cold startIdle spikes between requestsPre-warm workers, tune scale-to-zero thresholds
Wrong GPU tierLow SM util on high-memory GPURight-size to a smaller tier
KV cache pressureHigh memory, low computeReduce context length or add KV cache offloading

---

The Fleet-Level View

Detecting underutilization on a single GPU is one thing. At fleet scale — multiple clusters, mixed providers, dozens of models — the problem compounds. A model underutilizing a 4xH100 node by 40% is burning $12K/month in idle capacity. Multiply that across a fleet and it becomes the largest line item nobody is tracking.

Fleet-level detection requires aggregating per-GPU telemetry into a control plane that surfaces waste by workload, cluster, and tier — not just individual node dashboards.

See how Paralleliq surfaces underutilization across your inference fleet →

---

Next in the GPU Ops Field Guide: [OOM Root Cause for Inference Workloads →](/blog/gpu-ops-oom-root-cause)

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free