Free Tool

vLLM Configuration Calculator & Optimizer.

Get a recommended max_num_seqs, KV cache allocation, and speculative decoding decision for your deployment — and see whether your configuration will meet your p95 latency target under real traffic.

Enter your model, hardware, and traffic profile to get a recommended vLLM configuration and predicted latency.

Model

GPU Type

GPU Count

Quantization

Workload Type

Avg Prompt Length (tokens)

Avg Output Length (tokens)

Avg Requests / sec

Peak Requests / sec

p95 Latency Target (ms)

⚠ Peak traffic (30 req/s) exceeds estimated capacity (7.0 req/s). Add replicas or GPUs.

Recommended max_num_seqs

H100 80GB × 1 · Llama 3 8B / Mistral 7B

11,420 ms

Predicted p95 latency

Overloaded

vs 1,000ms target

10 ms

Avg TTFT

25 ms

ms / output token

1,800 tok/s

Max throughput

0.10 GB

KV cache / request

615

Max seqs (memory limit)

Recommended vLLM Config

max_num_seqs = 45

gpu_memory_utilization = 0.90

max_model_len = 768

tensor_parallel_size = 1

quantization = None

VRAM Allocation — 80 GB total

Model weights — 16 GB

KV cache available — 62.0 GB

Overhead — 2 GB

Speculative Decoding

Not recommendedEnable

High concurrency — overhead of draft model outweighs gains. Use batching instead.

What moves the needle most

🟡Add 1 GPU (2× H100 80GB)

−3,210ms

→ 8,210ms

🟡Cap output to 128 tokens

−3,200ms

→ 8,220ms

🟡Add a replica (halve load per instance)

−3,128ms

→ 8,292ms

Adjust inputs on the left to apply these changes.

Get your full config analysis. Enter your work email for a detailed breakdown.

KV cache per request = 2 × layers × kv_heads × head_dim × bytes × context_length. p95 latency = p95 TTFT + output_tokens × ms/token + queuing delay (M/M/1). Throughput baselines from vLLM benchmarks on well-configured deployments.

Already deployed? See what's actually happening.

This calculator estimates based on your inputs. piqc scans your running cluster and tells you the actual tok/sec, memory utilization, and configuration gaps — no agents, no write access required.

Run a free scan All calculators

Get more from the cluster you already have.

Start for Free