GPU Ops Field Guide

GPU Fleet Observability: What to Monitor and Why

By Sam Hosseini·May 16, 2026·7 min read

A single GPU dashboard is not fleet observability. At scale, the metrics that matter are aggregated, correlated, and surfaced as actionable signals — not raw telemetry. Here's what to build.

Why Single-GPU Metrics Aren't Enough

Most GPU monitoring starts with nvidia-smi or a per-node Grafana dashboard. For a single GPU or a small cluster, that's sufficient. For a fleet — multiple clusters, mixed GPU tiers, dozens of models, multiple cloud providers — per-node metrics create more noise than signal.

The questions that matter at fleet scale are different:

Which models are underutilizing their GPU tier across the whole fleet?
Which clusters have systematic waste patterns vs. one-off anomalies?
What is the fleet-wide trend in KV cache pressure over the past 7 days?
Which provider is delivering the worst GPU-to-cost ratio this week?

Answering these requires aggregated, correlated observability — not more dashboards.

---

The Four Layers of GPU Fleet Observability

Layer 1 — Hardware Metrics (per GPU)

The foundation. Collected via NVIDIA DCGM and exposed through Prometheus.

Metric	Why It Matters
SM Utilization	Actual compute usage vs. capacity
Memory Bandwidth Utilization	Whether the GPU is memory-bandwidth-bound
VRAM Used / Free	Headroom before OOM or KV cache pressure
GPU Temperature	Thermal throttling risk
Power Draw	Cost correlation and thermal headroom
PCIe Throughput	Data transfer bottlenecks

Layer 2 — Inference Server Metrics (per model)

Collected from vLLM, TGI, SGLang, or Triton metrics endpoints.

Metric	Why It Matters
Request throughput (req/s)	Capacity vs. demand
Time to first token (TTFT)	Prefill efficiency
Inter-token latency	Decode efficiency
KV cache hit rate	Prefix caching effectiveness
Queue depth	Whether the server is keeping up
Batch size distribution	Continuous batching effectiveness

Layer 3 — Workload Metrics (per deployment)

Collected from your orchestration layer (Kubernetes, Ray, custom scheduler).

Metric	Why It Matters
Pod restart count	OOM or crash frequency
Replica count vs. traffic	Autoscaling efficiency
Request error rate	Model or infrastructure health
Cold start frequency	Scale-to-zero configuration effectiveness

Layer 4 — Fleet-Level Aggregations

This is what most teams are missing. Aggregating layers 1–3 across the whole fleet to answer fleet-scale questions.

Aggregation	Signal
Fleet-wide GPU utilization distribution	What % of GPUs are under 50% SM util?
Tier mismatch rate	How many models are on the wrong GPU tier?
Provider cost efficiency	Cost per useful GPU-hour by provider
KV cache pressure by model	Which models are cache-constrained?

---

Instrumentation Stack

A practical fleet observability stack:

NVIDIA DCGM Exporter (per node)
    → Prometheus (metrics aggregation)
    → Grafana (dashboards)
    → Alertmanager (threshold alerts)

vLLM / TGI metrics endpoint (per model)
    → Prometheus

Kubernetes metrics (per pod/deployment)
    → kube-state-metrics → Prometheus

Fleet aggregation layer
    → Recording rules in Prometheus
    → Custom fleet dashboard in Grafana

The key is recording rules — pre-computed aggregations that answer fleet-scale questions without running expensive ad-hoc queries against raw telemetry.

---

Alert Design Principles

Most GPU alert setups generate too many alerts on transient spikes and miss the slow-burn patterns that actually cost money.

Alert on trends, not spikes:

# Bad: alerts on momentary spike
alert: HighGPUMemory
expr: gpu_memory_used_bytes > 0.9 * gpu_memory_total_bytes

# Better: alerts on sustained pressure
alert: SustainedGPUMemoryPressure
expr: avg_over_time(gpu_memory_used_ratio[15m]) > 0.88

Alert on fleet patterns, not individual nodes:

alert: FleetWideUnderutilization
expr: avg(gpu_sm_utilization) by (cluster) < 0.45
for: 30m

Alert on cost signals, not just technical ones:

alert: ExpensiveTierUnderutilized
expr: gpu_sm_utilization{tier="h100"} < 0.35
for: 1h
annotations:
  summary: "H100 running below 35% SM util for 1 hour — possible tier mismatch"

---

The Visibility Gap at Scale

The most dangerous fleet observability failure mode isn't missing metrics — it's having metrics but no one looking at the right level. Per-node dashboards exist but fleet-level patterns go undetected for weeks.

The discipline of fleet observability is about designing the system so that the signals that matter — tier mismatches, systematic waste, KV cache pressure trends — surface automatically as actionable findings, not buried in dashboards that require human interpretation.

See how Paralleliq aggregates fleet-level GPU observability into actionable findings →

---

Next in the GPU Ops Field Guide: [Serverless GPU Cold Start Latency: Causes and Solutions →](/blog/gpu-ops-serverless-cold-start)

GPU Fleet Observability: What to Monitor and Why

Why Single-GPU Metrics Aren't Enough

The Four Layers of GPU Fleet Observability

Instrumentation Stack

Alert Design Principles

The Visibility Gap at Scale

More articles

How to Detect GPU Underutilization in AI Inference Workloads

Audit Trails for AI Infrastructure Changes

Multi-Cluster GPU Visibility Across Providers

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.