GPU Ops Field Guide

How to Detect GPU Underutilization in AI Inference Workloads

By Sam Hosseini·May 16, 2026·7 min read

GPU utilization percentage is the most-watched metric in AI infrastructure — and the most misleading. Here's what to measure instead, and how to instrument your fleet to catch waste before it compounds.

Why GPU Utilization % Lies to You

The number you see in nvidia-smi — say, 87% GPU utilization — measures whether the GPU is doing something, not whether it's doing useful work at full capacity. A GPU can report high utilization while:

Waiting on CPU preprocessing to finish
Sitting idle between inference requests in a serverless setup
Running at a fraction of its memory bandwidth capacity
Processing a batch size so small it barely exercises the hardware

True underutilization hides behind a healthy-looking number.

---

The Four Metrics That Actually Matter

1. SM (Streaming Multiprocessor) Utilization

This is the compute utilization of the GPU cores themselves. Available via NVIDIA DCGM (DCGM_FI_DEV_GPU_UTIL). Anything consistently below 60% on an inference workload is a signal worth investigating.

2. Memory Bandwidth Utilization

GPUs are memory-bandwidth-bound for most LLM inference. If you're using less than 70% of available memory bandwidth (DCGM_FI_DEV_MEM_COPY_UTIL), you're leaving throughput on the table. Check this alongside SM utilization — a gap between the two usually means the CPU is the bottleneck.

3. GPU Memory Occupancy

High SM utilization with low memory occupancy often means your batch sizes are too small. The GPU is active but not saturated — you're paying for H100 capacity and getting A10G throughput.

4. Request Queue Depth + Inter-Request Idle Time

For inference specifically, the gap between requests is where utilization bleeds out. If your GPU is idle for 200ms between 50ms inference calls, your effective utilization is under 20% regardless of what nvidia-smi shows.

---

How to Instrument for Detection

Step 1 — Enable DCGM

NVIDIA's Data Center GPU Manager exposes the metrics above via Prometheus. If you're on Kubernetes, deploy dcgm-exporter as a DaemonSet. This gives you per-GPU, per-pod telemetry at 1-second resolution.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

Step 2 — Define Utilization Thresholds

Set alerts, not just dashboards. Suggested thresholds for inference workloads:

Metric	Warning	Critical
SM Utilization (5-min avg)	< 50%	< 30%
Memory Bandwidth	< 60%	< 40%
Inter-request idle	> 150ms	> 300ms

Step 3 — Correlate with CPU Metrics

GPU underutilization is almost always caused by something upstream. Add CPU utilization, tokenization latency, and data pipeline throughput to the same dashboard. If CPU is pegged at 100% when your GPU is at 40%, you've found your bottleneck.

Step 4 — Profile at the Model Level

Use nvtx markers or vLLM's built-in profiling to identify which phases of inference are causing idle time — prefill, decode, KV cache eviction, or scheduling overhead.

---

Common Causes and What to Do

Root Cause	Signal	Fix
CPU-bound preprocessing	GPU idle, CPU high	Move tokenization to GPU or parallelize
Batch size too small	Low memory occupancy	Increase max batch size or use continuous batching
Serverless cold start	Idle spikes between requests	Pre-warm workers, tune scale-to-zero thresholds
Wrong GPU tier	Low SM util on high-memory GPU	Right-size to a smaller tier
KV cache pressure	High memory, low compute	Reduce context length or add KV cache offloading

---

The Fleet-Level View

Detecting underutilization on a single GPU is one thing. At fleet scale — multiple clusters, mixed providers, dozens of models — the problem compounds. A model underutilizing a 4xH100 node by 40% is burning $12K/month in idle capacity. Multiply that across a fleet and it becomes the largest line item nobody is tracking.

Fleet-level detection requires aggregating per-GPU telemetry into a control plane that surfaces waste by workload, cluster, and tier — not just individual node dashboards.

See how Paralleliq surfaces underutilization across your inference fleet →

---

Next in the GPU Ops Field Guide: [OOM Root Cause for Inference Workloads →](/blog/gpu-ops-oom-root-cause)

How to Detect GPU Underutilization in AI Inference Workloads

Why GPU Utilization % Lies to You

The Four Metrics That Actually Matter

How to Instrument for Detection

Common Causes and What to Do

The Fleet-Level View

More articles

Multi-Cluster GPU Visibility Across Providers

OOM Root Cause for Inference Workloads

KV Cache Pressure: Symptoms, Causes, and Fixes

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.