ParallelIQ
Free Tool

vLLM Configuration Calculator & Optimizer.

Get a recommended max_num_seqs, KV cache allocation, and speculative decoding decision for your deployment — and see whether your configuration will meet your p95 latency target under real traffic.

Enter your model, hardware, and traffic profile to get a recommended vLLM configuration and predicted latency.

Peak traffic (30 req/s) exceeds estimated capacity (7.0 req/s). Add replicas or GPUs.

KV cache per request = 2 × layers × kv_heads × head_dim × bytes × context_length. p95 latency = p95 TTFT + output_tokens × ms/token + queuing delay (M/M/1). Throughput baselines from vLLM benchmarks on well-configured deployments.

Already deployed? See what's actually happening.

This calculator estimates based on your inputs. piqc scans your running cluster and tells you the actual tok/sec, memory utilization, and configuration gaps — no agents, no write access required.

Get more from the cluster you already have.

Start for Free