Free Tool
vLLM Configuration Calculator & Optimizer.
Get a recommended max_num_seqs, KV cache allocation, and speculative decoding decision for your deployment — and see whether your configuration will meet your p95 latency target under real traffic.
Enter your model, hardware, and traffic profile to get a recommended vLLM configuration and predicted latency.
⚠ Peak traffic (30 req/s) exceeds estimated capacity (7.0 req/s). Add replicas or GPUs.
KV cache per request = 2 × layers × kv_heads × head_dim × bytes × context_length. p95 latency = p95 TTFT + output_tokens × ms/token + queuing delay (M/M/1). Throughput baselines from vLLM benchmarks on well-configured deployments.
Already deployed? See what's actually happening.
This calculator estimates based on your inputs. piqc scans your running cluster and tells you the actual tok/sec, memory utilization, and configuration gaps — no agents, no write access required.