AI Infrastructure

10 GPU Fleet Findings — And Who Each One Matters To

By Sam Hosseini·May 31, 2026·12 min read

Not every GPU fleet problem looks the same from every seat. Here are the ten failure modes Paralleliq detects, what each one means, and why platform teams, GPUaaS providers, inference providers, and liquidity markets each care about different ones.

GPU infrastructure problems don't announce themselves. They accumulate quietly — in cloud bills that climb without obvious cause, in models that run slower than they should, in jobs that sit pending while hardware sits idle. Most teams discover them weeks later, if at all.

Paralleliq scans GPU infrastructure continuously and surfaces ten categories of waste, inefficiency, and risk. But here's what most discussions of GPU optimization miss: not every finding matters to every operator. A platform team running their own inference cluster has different priorities than a GPUaaS provider managing a shared fleet — and an inference provider like Baseten or Fireworks has different concerns than a liquidity market trading spare GPU capacity.

This post explains all ten findings and maps each one to the operators who should care most.

---

The Ten Findings

1. Tier Misplacement

A model is running on a GPU that doesn't match its memory requirements. This cuts both ways: the model might be too large for the GPU (causing OOM crashes or performance degradation from memory swapping), or the GPU might be far more powerful than the model needs (wasted spend at a higher tier than the workload justifies).

Tier misplacement is the most universal finding. Every operator, regardless of business model, loses money or reliability when models land on the wrong hardware. It's also one of the hardest problems to catch manually — GPU utilization metrics look normal, but the underlying mismatch between model requirements and hardware tier is invisible without model-aware instrumentation.

Who it matters to: Everyone. Platform teams overpay. GPUaaS providers misallocate inventory. Inference providers compress their own margin. Liquidity markets see mismatched capacity that could be rebalanced or swapped.

---

2. GPU Overprovisioning

The GPU has significantly more memory or compute than the workload actually uses. The model fits, runs, and appears healthy — but a cheaper GPU tier would deliver the same output at a fraction of the cost.

This is distinct from tier misplacement. Misplacement is the wrong tier for the model's requirements. Overprovisioning is the right tier for the model's requirements — just more of it than the workload actually saturates. A small 7B model on an H100 might technically run fine, but it's using 8% of available VRAM. An L4 would do the same job.

Overprovisioning is common in teams that provisioned conservatively during initial deployment and never revisited. The model was once under development; the larger GPU made sense for experimentation. It went to production and the hardware stayed.

A particularly powerful variant: when overprovisioning is detected on a multi-tenant API endpoint, fixing it doesn't just improve margins for one workload — it improves provider margin across every tenant simultaneously. One change, fleet-wide impact.

Who it matters to: Platform teams (direct cost savings), GPUaaS providers and inference providers (margin recovery), liquidity markets (rebalanceable inventory).

---

3. OOM Risk

Memory utilization is critically high and OOM kill events have been observed. The workload is actively crashing under load.

This is a reliability finding, not just a cost finding. Unlike overprovisioning, where the hardware has too much capacity, OOM risk means the hardware has too little. The model is too large for the GPU tier it's on, memory pressure is building, and the container is being killed when utilization spikes. Every OOM event is a production incident — user-facing downtime, dropped requests, and SLA violations.

The combination of high memory utilization and observed OOM kills is important. High utilization alone might be acceptable. Observed kills confirm the system has already crossed the line.

Who it matters to: Platform teams (operational risk), inference providers (SLA violations). GPUaaS providers and liquidity markets don't directly absorb the blast radius.

---

4. Cold Start Latency

A serverless deployment is configured with minReplicas=0, meaning it scales completely to zero when idle. When a request arrives, the GPU must be allocated and the model loaded before the first token can be served — often taking 8–12 seconds for a large model.

For user-facing applications, this breaks latency SLOs. A chatbot that takes 12 seconds before it starts responding is not a viable product, even if the time-to-token-once-started is fast. The cold start problem is architectural: scale-to-zero is economically attractive because it eliminates always-on GPU costs, but it trades those savings for unpredictable latency spikes that surface directly to users.

The remediation is straightforward — set minReplicas ≥ 1 to keep at least one warm replica at all times — but it requires an explicit decision to accept the always-on cost in exchange for latency predictability.

Who it matters to: Platform teams (user experience), inference providers (SLA compliance and customer retention). This is an operational concern, not a fleet economics concern.

---

5. Scale-to-Zero Thrashing

The deployment is scaling up and down repeatedly in short cycles, triggering repeated cold starts and GPU allocation overhead. Unlike a single cold start event, thrashing creates a pattern: the model scales to zero during a quiet period, a burst of requests arrives, it scales back up (incurring another cold start), traffic subsides, it scales down again.

This is often caused by misconfigured autoscaling thresholds. The scale-down grace period is too short, so the system interprets normal inter-request variability as idle time and aggressively scales down. Each cycle has a cost: GPU allocation latency, model loading time, and the wasted effort of spinning resources up and down repeatedly.

Who it matters to: Platform teams and inference providers. The fix is configuration tuning, not hardware changes.

---

6. Dark Capacity

A GPU node is allocated, metered, and billed — but serving zero active traffic. No workloads are running on it. Nothing is scheduled. The cost clock is running and no value is being delivered.

Dark capacity is silent budget burn. It doesn't surface as a workload problem because there is no workload. It doesn't appear in utilization dashboards as high or low — it appears as zero, which is easy to overlook. GPU billing is continuous regardless of whether the hardware is serving requests, so a node sitting dark for a week at H100 rates is a significant unrecovered cost.

The finding requires detecting not just low utilization but true idleness — zero active traffic, confirmed allocation, no pending workloads. A node with low utilization might be legitimately underloaded. A node with zero traffic and no active requests is dark capacity.

Who it matters to: GPUaaS providers (direct margin loss), inference providers (cost recovery), liquidity markets (this is tradeable idle supply).

---

7. Batch Over-Tiered

A batch processing job is running on high-end GPU hardware — typically an H100 — at low compute utilization. The job completes, but the hardware is doing far less work than it's capable of, and the cost of that hardware is billed for the full duration.

Batch jobs that perform classification, embedding generation, or document processing rarely need the memory bandwidth of a top-tier GPU. They need enough VRAM to hold the model and enough compute to process inputs sequentially. An A10G or L4 would complete the same job at a third of the cost, often with comparable duration.

The pattern is common in teams that run batch jobs on the same infrastructure as their real-time inference workloads. The scheduling is convenient, the hardware is available — but the economics don't hold up when the job runs nightly and the H100 rate is billed throughout.

MIG (Multi-Instance GPU) is worth considering here: rather than rescheduling the batch job to different hardware, MIG can partition the H100 into smaller instances, letting the batch job run on one slice while other workloads use the rest. This is particularly useful when the batch job runs on an H100 because that's what's available, not because that's what it needs.

Who it matters to: Platform teams (job cost reduction), GPUaaS providers (utilization optimization), liquidity markets (rebalanceable capacity).

---

8. Batch Suboptimal Batch Size

A batch job is processing inputs one at a time — batch size 1 — when available GPU memory could support processing 16, 32, or 64 inputs per step. The GPU is doing sequential work when it could be doing parallel work, leaving the majority of its throughput on the table.

This is a pure configuration problem. The hardware is right, the model is right, the job is structured correctly — but the serving engine is configured to process one input at a time. Every GPU step that processes a single input instead of a full batch is a step where 95%+ of available compute is idle.

The impact is dramatic. A job that takes 4 hours at batch size 1 might complete in 30 minutes at batch size 32, on the same hardware, at the same cost. The savings aren't from using cheaper hardware — they're from using the existing hardware correctly.

Who it matters to: Platform teams directly — this is an internal configuration problem. The fix requires no infrastructure change.

---

9. Fragmentation

A training or fine-tuning job is sitting in the scheduler queue, waiting. It has requested 8 GPUs for tensor-parallel execution. The cluster has 8 free GPUs available. But those 8 GPUs are split 4+4 across two different nodes — and tensor parallelism requires all GPUs to be on the same node, connected over NVLink for high-bandwidth inter-GPU communication.

The job cannot schedule. The cluster has enough capacity in aggregate but not in the right topology. The 8 requested GPUs could be used — but only as a contiguous block on a single node. Fragmented across nodes, they're effectively unavailable for this workload. Both the job and the GPUs sit idle.

This is why fragmentation matters: it's not a shortage. The hardware exists. The problem is topology — the result of incremental workload scheduling over time that leaves free capacity scattered rather than consolidated.

Fragmentation is most relevant for large model training and inference workloads that use tensor parallelism. Data-parallel workloads don't need contiguous allocation — each replica gets its own GPU — so fragmentation doesn't affect them. But any workload that needs multiple GPUs in tight NVLink communication is at risk.

It's also worth noting what doesn't fix fragmentation: adding more GPUs. More GPUs means more potential for fragmented free space. The fix is consolidation — draining and reshuffling workloads to open a contiguous block.

Who it matters to: GPUaaS providers (scheduling efficiency), inference providers (fleet scheduling), liquidity markets (stranded capacity identification).

---

10. Low Throughput

The GPU is the right tier. It's not idle. The model is the right size. But the serving engine is delivering far fewer tokens per second than the hardware should support.

An H100 running an 8B model should deliver roughly 1,800 tok/sec with a well-configured vLLM deployment. If it's delivering 340 tok/sec, something is wrong — but the problem isn't the hardware. It's the configuration. The GPU appears healthy by every standard utilization metric, but the actual output rate is suppressed.

The most common causes: maxnumseqs is set too low, starving the GPU of concurrent requests to batch together; KV cache utilization is high, forcing the serving engine to recompute attention states instead of retrieving them from cache; or the CPU is saturated, preventing the serving engine from preprocessing requests fast enough to keep the GPU fed.

Low throughput is different in character from the other findings. Most GPU findings involve the wrong resource being used (misplacement, overprovisioning) or the right resource sitting idle (dark capacity, fragmentation). Low throughput means the right resource is active but underperforming — a harder signal to detect, and a more sophisticated finding as a result.

For inference providers, the economics are direct: throughput is the denominator in cost-per-token. A 5x throughput improvement on the same hardware is a 5x improvement in gross margin on that deployment. No procurement, no rebalancing — just configuration tuning.

MIG is not the right remediation for low throughput. Smaller GPU slices won't fix a misconfigured serving engine — they'll give the misconfigured engine less memory to work with. The fix is serving engine configuration: batch sizes, cache settings, and CPU capacity.

Who it matters to: Platform teams (model performance), inference providers (margin per token — highest priority finding for this segment).

---

Mapping Findings to Operators

Different operators see different problems as existential versus operational. The table below maps each finding to the buyer personas who should care most.

Finding	Platform Team	GPUaaS Provider	Inference Provider	Liquidity Market
Tier misplacement	✓	✓	✓	✓
GPU overprovisioned	✓	✓	✓	✓
OOM risk	✓		✓
Cold start latency	✓		✓
Scale-to-zero thrashing	✓		✓
Dark capacity		✓	✓	✓
Batch over-tiered	✓	✓		✓
Batch suboptimal size	✓
Fragmentation		✓	✓	✓
Low throughput	✓		✓

A few patterns stand out.

Tier misplacement and overprovisioning are universal — every operator loses when models land on the wrong hardware or more hardware than they need. These are the highest-priority findings across all segments.

Platform teams care most about operational reliability — OOM risk, cold starts, thrashing, and batch misconfiguration are all internal problems that affect the teams running the workloads. These findings don't surface as fleet economics problems; they surface as incidents, latency spikes, and missed batch windows.

GPUaaS providers care most about fleet economics — dark capacity, fragmentation, and tier misplacement directly hit utilization rates and margin. These are the findings that determine whether a GPU cloud is profitable at scale.

Inference providers are the most sensitive overall — they own both the infrastructure cost and the customer SLA simultaneously. Tier misplacement and overprovisioning compress their margin. OOM and cold starts break their SLAs. Dark capacity is revenue loss. Low throughput is the most directly tied to their unit economics, since their business is priced per token.

Liquidity markets care about moveable capacity — dark capacity, fragmentation, and misplacement identify GPU resources that are either idle, stranded, or mismatched. These are the signals that a market could act on: idle GPUs that could be resold, fragmented capacity that could be consolidated, mismatched allocations that could be swapped.

---

What This Means for Buyers

If you're a platform team at a Series B/C AI company: the findings that will save you the most money the fastest are tier misplacement and overprovisioning. The findings that will prevent the most incidents are OOM risk and batch misconfiguration.

If you're a GPUaaS provider: start with dark capacity detection and fragmentation. These are directly tied to utilization rates and determine whether your fleet economics work at scale. Tier misplacement matters too — it represents inventory that could be serving higher-value workloads.

If you're an inference provider: throughput and tier misplacement are your highest-leverage findings. Throughput because it's directly tied to cost-per-token — your unit economics. Misplacement because every mismatch between model requirements and GPU tier compresses margin on that deployment.

If you're building or operating a GPU liquidity market: dark capacity is your primary signal. It identifies supply that exists but isn't being used and could be rebalanced or resold. Fragmentation surfaces stranded supply that a scheduling-aware market could unlock.

---

The ten findings are not equally important to everyone. But every GPU fleet, at scale, eventually encounters most of them. The question isn't whether these problems will appear — it's whether your infrastructure can surface them before they become visible on a bill or in a production incident.

That's what model-aware GPU fleet optimization is for.

---

The Bottom Line

Ten findings. Four buyer types. One common thread: GPU waste is invisible without instrumentation.

Platform teams lose money to misplacement and overprovisioning they can't see. Inference providers compress their own margin on misconfigured serving engines. GPUaaS providers watch dark capacity and fragmentation silently erode utilization rates. Liquidity markets can't move capacity they can't identify.

The first step is making the invisible visible — knowing which models are on the wrong hardware, which nodes are billing without serving, and where throughput is leaking before it shows up on a bill or in a production incident.

Start with piqc — the open-source GPU waste scanner — or reach out to discuss the full optimization layer for your fleet.