GPU Right-Sizing: Matching Tier to Workload

Running a 7B model on an H100 is as wasteful as running a 70B model on an A10G. Right-sizing GPU tiers is one of the highest-leverage cost optimizations in inference — and most teams get it wrong.
The Two Directions of Mismatch
GPU tier mismatches run in both directions — and both are expensive.
Over-tiered: A small model on a high-end GPU. The model fits easily, runs fast, but consumes a fraction of the available VRAM and compute. You're paying for an H100 and getting A10G-level workload density.
Under-tiered: A large model crammed onto a GPU with insufficient VRAM. The model barely fits, KV cache is constrained, batch sizes are tiny, and the system runs at the edge of OOM. Latency suffers and stability is fragile.
Most teams discover mismatches reactively — after a cost audit or an OOM incident. The goal is to catch them proactively.
---
GPU Tier Reference for LLM Inference
| GPU | VRAM | Best Fit |
|---|---|---|
| A10G | 24 GB | 7B–13B models, moderate concurrency |
| L40S | 48 GB | 13B–34B models, higher concurrency |
| A100 40GB | 40 GB | 13B–34B models, training and inference |
| A100 80GB | 80 GB | 34B–70B models, high concurrency |
| H100 80GB | 80 GB | 70B models, maximum throughput |
| H100 NVL | 94 GB | 70B+ models, long context |
These are starting points. Actual fit depends on quantization, batch size, context length, and concurrency targets.
---
How to Right-Size a Workload
Step 1 — Measure actual VRAM consumption
Don't estimate — measure. Deploy the model with realistic traffic and record peak VRAM usage:
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounitsAdd 15–20% headroom for KV cache growth under peak load.
Step 2 — Calculate effective VRAM requirement
Required VRAM = Model weights + KV cache (peak) + Activation memory + 15% headroomFor a 70B model in FP16: ~140GB weights alone. That requires tensor parallelism across 2x H100 80GB or quantization to fit on a single node.
For a 7B model in INT8: ~7GB weights. An A10G has substantial headroom for concurrent requests.
Step 3 — Check SM utilization on the current tier
If SM utilization is consistently below 40% on an A100 or H100, the workload doesn't justify the tier. Move down.
If SM utilization is above 90% and latency is suffering, the workload has outgrown the tier. Move up or scale horizontally.
Step 4 — Factor in concurrency
A single 7B model on an A10G might run at 30% SM utilization. But with 8 concurrent requests, that same GPU might hit 85% — making the tier correct at scale even if it looks oversized at low traffic.
Right-sizing is a function of concurrent load, not just model size.
---
Common Mismatches and Their Cost
| Scenario | Symptom | Annual Waste (est.) |
|---|---|---|
| 7B model on H100 (low concurrency) | SM util < 20% | $40K–$80K per GPU |
| 70B model on A100 40GB | Constant OOM, tiny batches | Latency + reliability cost |
| 13B model on A10G at high concurrency | KV cache pressure, slow | Throughput ceiling hit |
---
Quantization as a Right-Sizing Tool
Quantization reduces model weight size without significant accuracy loss, enabling a larger model to fit on a smaller (cheaper) GPU tier:
- INT8 (bitsandbytes, LLM.int8()): ~50% VRAM reduction, minimal quality loss
- AWQ / GPTQ (INT4): ~75% VRAM reduction, small quality trade-off
- FP8 (H100-native): ~50% VRAM reduction, near-zero quality loss on supported hardware
Quantizing a 70B model to INT4 brings it from ~140GB to ~35GB — fitting comfortably on a single A100 80GB instead of requiring a multi-GPU setup.
---
Right-Sizing at Scale
Manual right-sizing works for a handful of models. At fleet scale — dozens of models, multiple clusters, mixed providers — it becomes untenable. Models get deployed and forgotten. Traffic patterns shift. New model versions change memory profiles.
Continuous right-sizing requires automated monitoring of VRAM headroom, SM utilization, and concurrency patterns — with alerts when a workload drifts outside its optimal tier range.
See how Paralleliq detects tier mismatches across your inference fleet →
---
Next in the GPU Ops Field Guide: [KV Cache Pressure: Symptoms, Causes, and Fixes →](/blog/gpu-ops-kv-cache-pressure)