How to Reduce LLM Inference Costs Without Sacrificing SLA

GPU costs for LLM inference are significant and often poorly optimized. These are the highest-leverage levers — ranked by impact and implementation effort — for reducing spend without degrading latency or throughput.
The Cost Structure of LLM Inference
GPU inference costs have two components: the cost of the GPU itself (reserved or on-demand) and the utilization efficiency of that GPU. Most cost reduction efforts focus on the first — negotiating better rates, switching providers, or reducing replica count. The second is where the real leverage lives.
A GPU running at 40% effective utilization costs the same as one running at 85%. The difference is pure waste. Closing that gap is the highest-ROI cost reduction available before you spend a dollar on rate negotiation.
---
Lever 1: Right-Size GPU Tiers (Highest Impact)
Running a 7B model on an H100 when an A10G would suffice is the most common and most expensive waste pattern in inference infrastructure. H100 on-demand rates run 3–5x the cost of an A10G. If the workload doesn't justify it, the difference goes straight to waste.
How to assess: Check SM utilization and VRAM occupancy. If SM utilization is consistently below 50% and VRAM occupancy is below 40%, the workload is over-tiered.
Expected savings: 50–70% cost reduction per GPU by moving from H100 to A10G for appropriate workloads.
See also: [GPU Right-Sizing: Matching Tier to Workload](/blog/gpu-ops-right-sizing-gpu-tiers)
---
Lever 2: Continuous Batching
Static batching allocates GPU resources for a fixed batch size, whether all slots are filled or not. During low-traffic periods, you pay for empty slots.
Continuous batching (also called dynamic batching or in-flight batching) fills GPU compute with new requests as soon as a slot opens — without waiting for the full batch to complete. This increases effective throughput without adding GPU capacity.
vLLM, TGI, and SGLang all support continuous batching natively.
Expected improvement: 2–4x throughput increase at the same GPU cost, which translates directly to lower cost-per-token.
---
Lever 3: Quantization
Quantization reduces model precision from FP16 to INT8 or INT4, shrinking VRAM requirements and increasing throughput on the same hardware.
| Method | VRAM Reduction | Quality Impact | Best For |
|---|---|---|---|
| FP8 | ~50% | Negligible on H100 | Production on H100 |
| INT8 (bitsandbytes) | ~50% | Minimal | General use |
| AWQ (INT4) | ~75% | Small | Cost-sensitive workloads |
| GPTQ (INT4) | ~75% | Small | Offline quantization |
Quantizing a 70B model from FP16 to INT4 brings VRAM requirements from ~140GB to ~35GB — potentially fitting on a single A100 80GB instead of a multi-GPU setup, cutting infrastructure cost significantly.
---
Lever 4: Autoscaling to Zero
For workloads with variable traffic — overnight lows, weekend troughs, batch windows — keeping GPUs running at idle is expensive. Autoscaling down to zero replicas during low-traffic periods and scaling back up on demand eliminates idle spend.
The trade-off is cold start latency. For latency-sensitive workloads, scale to a minimum of one replica rather than zero, and use predictive scaling to pre-warm before anticipated traffic spikes.
Expected savings: 20–60% cost reduction for workloads with significant traffic variability.
---
Lever 5: Prefix Caching
For workloads with shared system prompts or repeated context (RAG pipelines, multi-turn conversations, agent workflows), prefix caching reuses computed KV cache blocks across requests.
The GPU doesn't recompute attention over the repeated context — it retrieves the cached result. This reduces compute per request proportionally to the fraction of the prompt that is shared.
vllm serve <model> --enable-prefix-cachingExpected improvement: 20–40% reduction in time-to-first-token for workloads with >50% prompt reuse.
---
Lever 6: Model Routing
Not every request needs your largest, most capable model. A routing layer that classifies requests by complexity and directs simple ones to a smaller, cheaper model can dramatically reduce average cost-per-request.
A practical split:
- Simple factual queries → 7B model (A10G)
- Reasoning tasks → 34B model (A100)
- Complex multi-step reasoning → 70B model (H100)
Expected savings: 40–60% reduction in average cost-per-request for mixed-complexity workloads.
---
Lever 7: Speculative Decoding
Speculative decoding uses a small draft model to generate candidate tokens, which the large model then verifies in parallel. When the draft model is accurate, this increases effective throughput without changing output quality.
For autoregressive generation tasks, this can increase throughput by 2–3x on the same hardware.
Implementation: vLLM supports speculative decoding natively via --speculative-model.
---
Prioritizing the Levers
| Lever | Impact | Effort | Start Here If... |
|---|---|---|---|
| Right-sizing | Very High | Low | SM util < 50% |
| Continuous batching | High | Low | Using static batching |
| Quantization | High | Medium | VRAM is the constraint |
| Autoscaling | Medium | Medium | Traffic is variable |
| Prefix caching | Medium | Low | Shared prompts exist |
| Model routing | High | High | Mixed-complexity traffic |
| Speculative decoding | Medium | Medium | Throughput is the goal |
Start with right-sizing and continuous batching — both are low-effort and high-impact. Then layer in quantization and prefix caching. Model routing and speculative decoding require more architectural investment but deliver the highest ceiling on cost reduction.
See how Paralleliq identifies cost reduction opportunities across your inference fleet →
---
Next in the GPU Ops Field Guide: [GPU Fleet Observability: What to Monitor and Why →](/blog/gpu-ops-fleet-observability)