ParallelIQ
AI Infrastructure

The Two Business Models Running AI Inference — And Why They Have Completely Different GPU Problems

By Sam Hosseini·May 31, 2026·10 min read
The Two Business Models Running AI Inference — And Why They Have Completely Different GPU Problems

Fireworks, Together, and Replicate sell tokens. Baseten and Modal sell deployments. The same GPU waste looks completely different from each seat — and fixing it requires a completely different pitch.

The inference market looks like one thing from the outside — companies that run AI models on GPUs and charge for it. But underneath that description are two fundamentally different businesses with fundamentally different economics, different customers, and — critically — different GPU problems.

Understanding the difference matters if you're building in this space, investing in it, or selling to it. The same infrastructure failure that shows up as a margin problem for one company shows up as a customer churn problem for the other.

---

Category 1: The Hosted Model API

Fireworks AI, Together AI, Replicate, Groq, Lepton AI, DeepInfra, Cerebras, Cohere, Mistral, AI21 Labs.

These companies have made a bet: they will host models, run the infrastructure, and charge customers per token. You call an API. They handle everything behind the scenes — GPU allocation, routing, scaling, cooling, networking, model loading. You never see a GPU.

Their business model is a spread business. They buy GPU capacity wholesale — reserved instances, spot markets, custom silicon — and sell inference retail, priced per million tokens. The margin is the spread between what they pay per token of compute and what they charge per token of output.

That spread sounds simple. In practice it's under constant pressure from three directions:

Competition compresses the price. The inference API market is intensely competitive. Fireworks, Together, Groq, and a dozen others are all racing to offer the lowest price per token on the same open-source models. Llama 3 70B costs roughly the same to run everywhere — the only way to win on price without losing money is to run it more efficiently than everyone else.

Utilization determines the floor. A GPU that isn't generating tokens is still generating a bill. Unlike a SaaS product where idle servers cost pennies, GPU idle time is expensive. An H100 at $3.50/hour sitting at 0% utilization for four hours is $14 of pure cost with zero revenue attached. Multiply that across a fleet of thousands of GPUs and idle time is an existential margin problem.

Efficiency determines the ceiling. Even a fully utilized GPU can underperform. A model configured to process one request at a time on an H100 might deliver 300 tok/sec. The same model, same GPU, properly configured, delivers 1,800 tok/sec. The difference is six times more revenue from the same hardware. Cost-per-token doesn't just depend on what GPU you're running — it depends on how efficiently that GPU is being used.

What the GPU problems look like from inside a Category 1 company

Every inefficiency at a Category 1 company shows up directly on the P&L. There's no customer buffer. When an H100 delivers 300 tok/sec instead of 1,800, that company is paying H100 rates and delivering T4-level output. The margin on that deployment is destroyed.

The specific failure modes:

Tier misplacement — hosting a small 7B model on an H100 because that's what's available, when an L4 would serve the same requests at a fifth of the cost. At scale across hundreds of models and thousands of requests per second, systematic tier misplacement can represent tens of millions of dollars in avoidable GPU spend.

Dark capacity — nodes allocated and billed but serving zero traffic. This is pure cost with no revenue. In a shared multi-tenant fleet, dark capacity often emerges as a scheduling artifact — workloads shift, demand drops, but allocated nodes don't drain. The billing continues.

Throughput suppression — models running well below their hardware's capability due to misconfigured serving engines. Low maxnumseqs, KV cache pressure, CPU bottlenecks feeding the GPU. The GPU looks busy but the token output rate is far below baseline.

Fragmentation — free GPUs scattered across nodes that can't be assembled into a contiguous block for large model deployments. The fleet has capacity but the scheduler can't use it. Jobs queue while hardware sits idle.

The core pain for Category 1 is this: standard infrastructure monitoring tells you GPU utilization. It doesn't tell you why a specific model is underperforming, which nodes are dark, or whether your tier allocation matches your workload mix. You're flying partially blind on the metrics that determine your margin.

---

Category 2: The Deployment Platform

Baseten, Modal, Beam Cloud, Salad Technologies, RunPod.

These companies have made a different bet: they will give customers the infrastructure to run their own models, either on the platform's GPU cloud or the customer's own infrastructure. The model belongs to the customer. The platform provides the deployment tooling, autoscaling, serving infrastructure, and operational layer.

Their business model is a platform business. Revenue comes from compute consumption, seats, or platform fees — not from tokens directly. The customer owns the model; the platform owns the experience of deploying and running it. Winning means customers stay on the platform, scale their usage, and don't churn to AWS or build it themselves.

That model creates a completely different set of pressures:

Reliability is the product. If a customer's model crashes repeatedly, cold-starts unpredictably, or underperforms, that's a platform reliability failure — even if the root cause is the customer's misconfiguration. The platform gets the blame because the platform is supposed to make this easy.

Customer success drives expansion. A customer whose deployment runs efficiently, scales smoothly, and costs predictably will grow their usage and add more models. A customer who can't figure out why their throughput is low or why their GPU bill doubled will churn or stay small.

Differentiation is operational intelligence. The underlying GPU hardware is largely commoditized. What differentiates a deployment platform is how smart it is about helping customers run their models well — right-sized hardware, optimal configuration, predictable scaling. Generic infrastructure is a race to the bottom on price. Intelligent infrastructure is a platform moat.

What the GPU problems look like from inside a Category 2 company

Category 2 companies feel infrastructure problems through their customers. The same failure mode that hits a Category 1 company's P&L hits a Category 2 company's support queue, NPS score, and churn rate.

The specific failure modes:

Cold start latency — a customer's serverless deployment scales to zero between traffic bursts and takes 10 seconds to respond when a new request arrives. The customer files a support ticket. The platform is perceived as slow even though the configuration choice (minReplicas=0) was the customer's.

Suboptimal batch size — a customer's batch job processes inputs one at a time because vLLM's maxnumseqs defaults are conservative. The job takes 6 hours instead of 45 minutes. The customer sees a large compute bill and wonders if the platform is expensive, when the real issue is configuration.

OOM crashes — a customer deploys a model slightly too large for their chosen GPU tier. Under load, memory pressure builds and the container crashes. Repeated OOM events look like platform instability.

Thrashing — autoscaling thresholds are too aggressive. The deployment scales down during a quiet period, then immediately has to scale back up when traffic resumes, incurring repeated cold starts. The customer sees unpredictable latency.

Cost opacity — the customer can't tell which deployment is responsible for which portion of their bill, or what configuration changes would reduce costs. They feel out of control.

The core pain for Category 2 is this: when a customer's deployment underperforms, the platform gets blamed — even when the root cause is a configuration problem the customer created. Without model-aware intelligence built into the platform, there's no way to proactively identify and fix these issues before they become support tickets and churn.

---

The Same Waste, Two Different Problems

Here's the thing: many of the same GPU inefficiencies affect both categories. Tier misplacement, overprovisioning, thrashing, low throughput — these happen on both sides. But the consequence is completely different.

Failure modeCategory 1 consequenceCategory 2 consequence
Tier misplacementMargin destroyed on that deploymentCustomer's deployment costs more than it should
Low throughputCost per token rises, revenue per GPU fallsCustomer complains model is slow, blames platform
Cold start / thrashingScheduling overhead on shared fleetCustomer SLA breach, support ticket, churn risk
Dark capacityDirect revenue lossLess relevant — customer pays for what they use
OOM riskCustomer-facing outagePlatform reliability failure, churn risk
Suboptimal batch sizeLess relevant — they configure serving themselvesCustomer's bill is higher than it should be

Category 1 feels waste on their income statement. They own the GPU, they own the model, they own the margin — so every inefficiency is their problem, immediately, in dollars.

Category 2 feels waste through their customers. The GPU cost might pass through, but the real consequence is customer experience — latency, crashes, unpredictable bills, and the support load that follows.

---

What Good Looks Like for Each

For a Category 1 company, good looks like model-aware fleet management. Knowing that model X belongs on an L4, not an H100. Knowing that deployment Y is running at 18% of baseline throughput. Knowing that node Z has been dark for 14 hours. Translating all of that into a dollar figure that maps directly to margin recovery.

The pitch is: we show you where your margin is leaking at the model and deployment level, not just the GPU level.

For a Category 2 company, good looks like platform intelligence that makes customers more successful. When a customer deploys a model, the platform should tell them they've chosen the wrong GPU tier, their batch size is configured suboptimally, or their autoscaling thresholds will cause thrashing. Before the support ticket. Before the churn conversation.

The pitch is: we give your platform a model-aware optimization layer so your customers run better deployments — and the platform gets the credit.

---

Why This Matters Now

Both categories are under pressure. Category 1 is in a price war — the only durable path to margin is efficiency, not pricing power. Category 2 is in a differentiation war — generic GPU clouds are commoditizing, and intelligence is the only moat.

In both cases, the answer runs through the same insight: GPU utilization metrics are not enough. Knowing that a GPU is at 72% utilization tells you almost nothing about whether that GPU is generating appropriate revenue (Category 1) or delivering appropriate performance (Category 2). What you need is model-aware observability — understanding the relationship between the model running on the hardware and whether that hardware is the right fit, correctly configured, and properly utilized.

That's the gap both categories are sitting on. And it's larger than most people in the infrastructure space have yet recognized.

---

The Bottom Line

There are two businesses running AI inference. One sells tokens and lives or dies on margin per token. The other sells deployments and lives or dies on customer success.

Same GPUs. Same models. Completely different problems.

The companies that figure out model-aware GPU optimization first — in either category — will have a durable cost or quality advantage that's very hard for competitors to replicate without building the same instrumentation layer from scratch.

Start with piqc — the open-source GPU waste scanner — or reach out to discuss how the full optimization layer maps to your specific business model.

More articles

Get more from the cluster you already have.

Start for Free