ParallelIQ
GPU Ops Field Guide

GPU Fleet Observability: What to Monitor and Why

By Sam Hosseini·May 16, 2026·7 min read
GPU Fleet Observability: What to Monitor and Why

A single GPU dashboard is not fleet observability. At scale, the metrics that matter are aggregated, correlated, and surfaced as actionable signals — not raw telemetry. Here's what to build.

Why Single-GPU Metrics Aren't Enough

Most GPU monitoring starts with nvidia-smi or a per-node Grafana dashboard. For a single GPU or a small cluster, that's sufficient. For a fleet — multiple clusters, mixed GPU tiers, dozens of models, multiple cloud providers — per-node metrics create more noise than signal.

The questions that matter at fleet scale are different:

  • Which models are underutilizing their GPU tier across the whole fleet?
  • Which clusters have systematic waste patterns vs. one-off anomalies?
  • What is the fleet-wide trend in KV cache pressure over the past 7 days?
  • Which provider is delivering the worst GPU-to-cost ratio this week?

Answering these requires aggregated, correlated observability — not more dashboards.

---

The Four Layers of GPU Fleet Observability

Layer 1 — Hardware Metrics (per GPU)

The foundation. Collected via NVIDIA DCGM and exposed through Prometheus.

MetricWhy It Matters
SM UtilizationActual compute usage vs. capacity
Memory Bandwidth UtilizationWhether the GPU is memory-bandwidth-bound
VRAM Used / FreeHeadroom before OOM or KV cache pressure
GPU TemperatureThermal throttling risk
Power DrawCost correlation and thermal headroom
PCIe ThroughputData transfer bottlenecks

Layer 2 — Inference Server Metrics (per model)

Collected from vLLM, TGI, SGLang, or Triton metrics endpoints.

MetricWhy It Matters
Request throughput (req/s)Capacity vs. demand
Time to first token (TTFT)Prefill efficiency
Inter-token latencyDecode efficiency
KV cache hit ratePrefix caching effectiveness
Queue depthWhether the server is keeping up
Batch size distributionContinuous batching effectiveness

Layer 3 — Workload Metrics (per deployment)

Collected from your orchestration layer (Kubernetes, Ray, custom scheduler).

MetricWhy It Matters
Pod restart countOOM or crash frequency
Replica count vs. trafficAutoscaling efficiency
Request error rateModel or infrastructure health
Cold start frequencyScale-to-zero configuration effectiveness

Layer 4 — Fleet-Level Aggregations

This is what most teams are missing. Aggregating layers 1–3 across the whole fleet to answer fleet-scale questions.

AggregationSignal
Fleet-wide GPU utilization distributionWhat % of GPUs are under 50% SM util?
Tier mismatch rateHow many models are on the wrong GPU tier?
Provider cost efficiencyCost per useful GPU-hour by provider
KV cache pressure by modelWhich models are cache-constrained?

---

Instrumentation Stack

A practical fleet observability stack:

NVIDIA DCGM Exporter (per node)
    → Prometheus (metrics aggregation)
    → Grafana (dashboards)
    → Alertmanager (threshold alerts)

vLLM / TGI metrics endpoint (per model)
    → Prometheus

Kubernetes metrics (per pod/deployment)
    → kube-state-metrics → Prometheus

Fleet aggregation layer
    → Recording rules in Prometheus
    → Custom fleet dashboard in Grafana

The key is recording rules — pre-computed aggregations that answer fleet-scale questions without running expensive ad-hoc queries against raw telemetry.

---

Alert Design Principles

Most GPU alert setups generate too many alerts on transient spikes and miss the slow-burn patterns that actually cost money.

Alert on trends, not spikes:

# Bad: alerts on momentary spike
alert: HighGPUMemory
expr: gpu_memory_used_bytes > 0.9 * gpu_memory_total_bytes

# Better: alerts on sustained pressure
alert: SustainedGPUMemoryPressure
expr: avg_over_time(gpu_memory_used_ratio[15m]) > 0.88

Alert on fleet patterns, not individual nodes:

alert: FleetWideUnderutilization
expr: avg(gpu_sm_utilization) by (cluster) < 0.45
for: 30m

Alert on cost signals, not just technical ones:

alert: ExpensiveTierUnderutilized
expr: gpu_sm_utilization{tier="h100"} < 0.35
for: 1h
annotations:
  summary: "H100 running below 35% SM util for 1 hour — possible tier mismatch"

---

The Visibility Gap at Scale

The most dangerous fleet observability failure mode isn't missing metrics — it's having metrics but no one looking at the right level. Per-node dashboards exist but fleet-level patterns go undetected for weeks.

The discipline of fleet observability is about designing the system so that the signals that matter — tier mismatches, systematic waste, KV cache pressure trends — surface automatically as actionable findings, not buried in dashboards that require human interpretation.

See how Paralleliq aggregates fleet-level GPU observability into actionable findings →

---

Next in the GPU Ops Field Guide: [Serverless GPU Cold Start Latency: Causes and Solutions →](/blog/gpu-ops-serverless-cold-start)

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free