Multi-Cluster GPU Visibility Across Providers

Most AI teams operate GPU infrastructure across multiple clusters, clouds, and providers. Getting a unified view of fleet health, cost, and utilization across all of them is one of the hardest operational problems at scale.
The Multi-Cluster Reality
AI infrastructure rarely lives in one place. A typical mid-scale AI team might run:
- A primary cluster on a GPU cloud provider (CoreWeave, Lambda, Vast.ai)
- Reserved capacity on AWS or GCP for burst workloads
- On-premises bare metal for sensitive or regulated workloads
- Development clusters on smaller, cheaper GPU instances
Each of these has its own monitoring stack, its own metrics format, its own access controls, and its own cost model. Getting a unified view across all of them requires deliberate architecture — and most teams never build it.
The result: GPU waste and performance problems that are invisible at the fleet level even when they're obvious on any individual cluster.
---
What Breaks at Multi-Cluster Scale
Metric silos Each cluster runs its own Prometheus instance. Comparing GPU utilization across clusters requires either federated Prometheus (complex) or a centralized metrics store (requires data pipeline work). Most teams end up with n separate dashboards and no fleet-level view.
Cost fragmentation GPU costs from three providers arrive in three different billing formats with different dimensions, different pricing models (spot vs. reserved vs. on-demand), and different time zones. Building a unified cost view requires normalization work that most teams defer indefinitely.
Model inventory gaps Which models are running where? At fleet scale, models get deployed across clusters and the inventory becomes stale within days. No single source of truth means you can't answer basic questions: "How many replicas of llama-70b are we running right now, across all clusters?"
Alert duplication and gaps Each cluster has its own alerting configuration. The same alert fires multiple times on correlated events. Meanwhile, fleet-level patterns — systematic underutilization across all clusters — generate no alert because no system is looking at the aggregate.
---
Architecture for Multi-Cluster Visibility
Layer 1 — Standardized telemetry collection
Deploy the same telemetry stack on every cluster, regardless of provider:
- NVIDIA DCGM Exporter (per node)
- Inference server metrics (vLLM, TGI, SGLang)
- Kubernetes kube-state-metrics
The output of each cluster is a Prometheus-compatible metrics stream with consistent label schemas:
gpu_sm_utilization{cluster="coreweave-us-east", provider="coreweave", tier="h100", model="llama-70b"}
gpu_sm_utilization{cluster="aws-us-east-1", provider="aws", tier="a100", model="mistral-7b"}Consistent labeling is the prerequisite for fleet-level aggregation. Without it, you can't join metrics across clusters.
Layer 2 — Centralized metrics aggregation
Options for aggregating metrics across clusters:
| Approach | Complexity | Best For |
|---|---|---|
| Prometheus Federation | Medium | Small number of clusters |
| Thanos / Cortex | High | Large-scale, long retention |
| Grafana Cloud / Datadog | Low | Teams that prefer managed services |
| Victoria Metrics | Medium | High-cardinality, cost-sensitive |
The centralized store becomes the single source of truth for fleet-level queries.
Layer 3 — Normalized cost data
Normalize GPU cost data across providers into a common schema:
{
"cluster": "coreweave-us-east",
"provider": "coreweave",
"gpu_type": "h100",
"billing_model": "reserved",
"cost_per_gpu_hour": 2.49,
"currency": "USD"
}Join this with utilization metrics to produce cost efficiency metrics: cost per useful GPU-hour by cluster, provider, and model.
Layer 4 — Fleet-level model inventory
Maintain a registry of what's running where. This can be as simple as a Kubernetes ConfigMap updated by your CI/CD pipeline, or as sophisticated as a dedicated model registry.
The minimum viable inventory:
- model: llama-70b
clusters:
- name: coreweave-us-east
replicas: 4
tier: h100
- name: aws-us-east-1
replicas: 2
tier: a100-80gb
total_replicas: 6
last_updated: 2026-05-16T14:00:00Z---
Fleet-Level Metrics That Matter
Once multi-cluster telemetry is unified, these are the aggregations that surface actionable insights:
| Metric | Query Pattern | Action Threshold |
|---|---|---|
| Fleet GPU utilization | avg(sm_util) by cluster | < 45% on any cluster |
| Provider cost efficiency | cost / useful_gpu_hours by provider | 20% worse than fleet avg |
| Tier mismatch rate | count(sm_util < 40%) / count(all) | > 15% of fleet |
| Cross-cluster latency variance | stddev(ttft) by model | High variance = routing problem |
| Model coverage | missing models in inventory | Any gap = shadow deployment |
---
Operational Patterns
Pattern 1 — Unified on-call runbook
A single runbook that works across all clusters, regardless of provider. Operators don't need to know which provider a cluster runs on to diagnose an issue — the telemetry is normalized.
Pattern 2 — Cross-cluster autoscaling
When one cluster is at capacity, route overflow to another. This requires fleet-level visibility into which clusters have headroom — which only works if telemetry is unified.
Pattern 3 — Provider benchmarking
With normalized cost and utilization data, you can measure which provider delivers the best GPU-to-cost ratio for each workload type. This informs future capacity decisions with data rather than intuition.
---
The Visibility Baseline
The minimum viable multi-cluster visibility setup:
- Standardized DCGM + inference server metrics on every cluster
- Consistent label schema across all clusters
- Centralized Prometheus or Thanos instance
- Single Grafana fleet dashboard with cluster-level drill-down
- Fleet-level alerts on aggregated signals, not per-cluster noise
This baseline takes 1–2 weeks to build and pays for itself within the first month by surfacing waste patterns that were previously invisible.
See how Paralleliq delivers unified fleet visibility across clusters and providers →
---
This concludes the GPU Ops Field Guide — 10 articles covering the core operational challenges of LLM inference infrastructure. [Start from Article #1 →](/blog/gpu-ops-detect-underutilization)