Free Tool

Inference Capacity Planner.

How many GPUs do you actually need? Input your model, peak traffic, and serving engine — and get a replica count, annual cost, and API vs self-host comparison.

Model Size

GPU Type

Serving Engine

Cloud Provider

Requests / sec (peak)

Avg output tokens

Capacity Headroom

Recommended Fleet

6× replicas

H100 80GB · 2 GPUs/replica

Total GPUs

4.2K tok/s

Capacity

$538,214

Annual cost

$6.6667

Cost / M tokens

Traffic scaling scenarios

Scenario	Replicas	GPUs	Annual
Current (baseline)	6	12	$538,214
2× traffic spike	11	22	$986,726
5× growth	26	52	$2.33M

Self-host vs serverless API

Provider	$/M tok	Annual
Together.ai	$0.88	$71,044	↑ 87%
Fireworks AI	$0.9	$72,659	↑ 87%
Groq	$0.59	$47,632	↑ 91%

At 4.2K tokens/sec you need 12 GPUs. Annual infrastructure cost of $538,214 at this scale. A control plane that keeps those GPUs at 70%+ utilization is worth $188,375/yr in recovered capacity.

Get your capacity plan. Enter your work email and we'll send your fleet recommendation.

Throughput estimated from empirical baselines scaled by GPU compute, model efficiency, and engine factor. Multi-GPU scaling uses conservative tensor-parallel efficiency (1×, 1.75×, 3.2×, 5.5× for 1/2/4/8 GPUs).

Want capacity estimates for your specific setup?

We'll send you a detailed capacity plan based on your inputs.