GPU Ops Field Guide

Multi-Cluster GPU Visibility Across Providers

By Sam Hosseini·May 16, 2026·7 min read

Most AI teams operate GPU infrastructure across multiple clusters, clouds, and providers. Getting a unified view of fleet health, cost, and utilization across all of them is one of the hardest operational problems at scale.

The Multi-Cluster Reality

AI infrastructure rarely lives in one place. A typical mid-scale AI team might run:

A primary cluster on a GPU cloud provider (CoreWeave, Lambda, Vast.ai)
Reserved capacity on AWS or GCP for burst workloads
On-premises bare metal for sensitive or regulated workloads
Development clusters on smaller, cheaper GPU instances

Each of these has its own monitoring stack, its own metrics format, its own access controls, and its own cost model. Getting a unified view across all of them requires deliberate architecture — and most teams never build it.

The result: GPU waste and performance problems that are invisible at the fleet level even when they're obvious on any individual cluster.

---

What Breaks at Multi-Cluster Scale

Metric silos Each cluster runs its own Prometheus instance. Comparing GPU utilization across clusters requires either federated Prometheus (complex) or a centralized metrics store (requires data pipeline work). Most teams end up with n separate dashboards and no fleet-level view.

Cost fragmentation GPU costs from three providers arrive in three different billing formats with different dimensions, different pricing models (spot vs. reserved vs. on-demand), and different time zones. Building a unified cost view requires normalization work that most teams defer indefinitely.

Model inventory gaps Which models are running where? At fleet scale, models get deployed across clusters and the inventory becomes stale within days. No single source of truth means you can't answer basic questions: "How many replicas of llama-70b are we running right now, across all clusters?"

Alert duplication and gaps Each cluster has its own alerting configuration. The same alert fires multiple times on correlated events. Meanwhile, fleet-level patterns — systematic underutilization across all clusters — generate no alert because no system is looking at the aggregate.

---

Architecture for Multi-Cluster Visibility

Layer 1 — Standardized telemetry collection

Deploy the same telemetry stack on every cluster, regardless of provider:

NVIDIA DCGM Exporter (per node)
Inference server metrics (vLLM, TGI, SGLang)
Kubernetes kube-state-metrics

The output of each cluster is a Prometheus-compatible metrics stream with consistent label schemas:

gpu_sm_utilization{cluster="coreweave-us-east", provider="coreweave", tier="h100", model="llama-70b"}
gpu_sm_utilization{cluster="aws-us-east-1", provider="aws", tier="a100", model="mistral-7b"}

Consistent labeling is the prerequisite for fleet-level aggregation. Without it, you can't join metrics across clusters.

Layer 2 — Centralized metrics aggregation

Options for aggregating metrics across clusters:

Approach	Complexity	Best For
Prometheus Federation	Medium	Small number of clusters
Thanos / Cortex	High	Large-scale, long retention
Grafana Cloud / Datadog	Low	Teams that prefer managed services
Victoria Metrics	Medium	High-cardinality, cost-sensitive

The centralized store becomes the single source of truth for fleet-level queries.

Layer 3 — Normalized cost data

Normalize GPU cost data across providers into a common schema:

{
  "cluster": "coreweave-us-east",
  "provider": "coreweave",
  "gpu_type": "h100",
  "billing_model": "reserved",
  "cost_per_gpu_hour": 2.49,
  "currency": "USD"
}

Join this with utilization metrics to produce cost efficiency metrics: cost per useful GPU-hour by cluster, provider, and model.

Layer 4 — Fleet-level model inventory

Maintain a registry of what's running where. This can be as simple as a Kubernetes ConfigMap updated by your CI/CD pipeline, or as sophisticated as a dedicated model registry.

The minimum viable inventory:

- model: llama-70b
  clusters:
    - name: coreweave-us-east
      replicas: 4
      tier: h100
    - name: aws-us-east-1
      replicas: 2
      tier: a100-80gb
  total_replicas: 6
  last_updated: 2026-05-16T14:00:00Z

---

Fleet-Level Metrics That Matter

Once multi-cluster telemetry is unified, these are the aggregations that surface actionable insights:

Metric	Query Pattern	Action Threshold
Fleet GPU utilization	avg(sm_util) by cluster	< 45% on any cluster
Provider cost efficiency	cost / useful_gpu_hours by provider	20% worse than fleet avg
Tier mismatch rate	count(sm_util < 40%) / count(all)	> 15% of fleet
Cross-cluster latency variance	stddev(ttft) by model	High variance = routing problem
Model coverage	missing models in inventory	Any gap = shadow deployment

---

Operational Patterns

Pattern 1 — Unified on-call runbook

A single runbook that works across all clusters, regardless of provider. Operators don't need to know which provider a cluster runs on to diagnose an issue — the telemetry is normalized.

Pattern 2 — Cross-cluster autoscaling

When one cluster is at capacity, route overflow to another. This requires fleet-level visibility into which clusters have headroom — which only works if telemetry is unified.

Pattern 3 — Provider benchmarking

With normalized cost and utilization data, you can measure which provider delivers the best GPU-to-cost ratio for each workload type. This informs future capacity decisions with data rather than intuition.

---

The Visibility Baseline

The minimum viable multi-cluster visibility setup:

Standardized DCGM + inference server metrics on every cluster
Consistent label schema across all clusters
Centralized Prometheus or Thanos instance
Single Grafana fleet dashboard with cluster-level drill-down
Fleet-level alerts on aggregated signals, not per-cluster noise

This baseline takes 1–2 weeks to build and pays for itself within the first month by surfacing waste patterns that were previously invisible.

See how Paralleliq delivers unified fleet visibility across clusters and providers →

---

This concludes the GPU Ops Field Guide — 10 articles covering the core operational challenges of LLM inference infrastructure. [Start from Article #1 →](/blog/gpu-ops-detect-underutilization)

Multi-Cluster GPU Visibility Across Providers

The Multi-Cluster Reality

What Breaks at Multi-Cluster Scale

Architecture for Multi-Cluster Visibility

Fleet-Level Metrics That Matter

Operational Patterns

The Visibility Baseline

More articles

How to Detect GPU Underutilization in AI Inference Workloads

GPU Right-Sizing: Matching Tier to Workload

GPU Fleet Observability: What to Monitor and Why

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.