Free Tool
Inference Capacity Planner.
How many GPUs do you actually need? Input your model, peak traffic, and serving engine — and get a replica count, annual cost, and API vs self-host comparison.
Throughput estimated from empirical baselines scaled by GPU compute, model efficiency, and engine factor. Multi-GPU scaling uses conservative tensor-parallel efficiency (1×, 1.75×, 3.2×, 5.5× for 1/2/4/8 GPUs).
Already running inference? See how close you are to the plan.
Most teams provision 2–3× what they actually need. piqc shows you the gap between your planned capacity and what your cluster is actually using.