How to Configure vLLM for Production

vLLM configuration is normally done through trial and error. Wrong max_num_seqs, misconfigured KV cache, or a bad speculative decoding decision can silently destroy throughput and latency. Here's how to get it right before you touch a cluster.
vLLM configuration is normally done through trial and error. Engineers pick a starting point, deploy, watch latency, adjust, redeploy. It works eventually — but it's slow, and misconfigured deployments can silently destroy throughput and inflate GPU spend for weeks before anyone notices.
This guide covers the four decisions that matter most: maxnumseqs, KV cache allocation, speculative decoding, and context length. Get these right and you'll land close to your performance targets on the first deployment.
---
The Configuration Problem
vLLM exposes dozens of parameters. Most of them don't matter much. Four of them interact in ways that can make or break a deployment:
- max_num_seqs — how many requests run concurrently
- gpu_memory_utilization — how much VRAM is reserved for the KV cache
- max_model_len — the maximum context window per request
- speculative decoding — whether to use a draft model to accelerate generation
These four parameters are deeply interdependent. Setting maxnumseqs too high without enough KV cache will cause OOM errors. Setting gpumemoryutilization too high leaves no room for the model weights. Getting speculative decoding wrong can increase latency instead of reducing it.
The right values depend on your model, GPU tier, average request length, and target throughput — not on defaults.
---
maxnumseqs: The Concurrency Setting
maxnumseqs controls how many sequences vLLM processes in a single batch. It is the most important parameter for throughput.
Setting it too low: GPU sits underutilized between requests. Throughput is limited even when the GPU has capacity.
Setting it too high: Each concurrent sequence requires KV cache space. Too many sequences exhaust VRAM, triggering OOM errors or forcing cache evictions that hurt latency more than concurrency helps throughput.
How to set it:
The right value is determined by how much VRAM is available after model weights are loaded, divided by the KV cache cost per sequence at your typical context length.
vllm serve <model> --max-num-seqs 32A rough starting point by model size and GPU:
| Model | GPU | Starting maxnumseqs |
|---|---|---|
| 7B | A10G (24GB) | 32–64 |
| 7B | A100-40GB | 64–128 |
| 13B | A100-40GB | 16–32 |
| 70B | 8xA100-80GB | 8–16 |
| 70B | 8xH100-80GB | 16–32 |
These are starting points. Use the vLLM Configuration Calculator to get a recommendation based on your specific model, GPU, and traffic profile.
---
KV Cache Allocation: gpumemoryutilization
gpumemoryutilization controls what fraction of GPU VRAM is reserved for the KV cache after model weights are loaded. The default is 0.9 (90%).
Too high: Risk of OOM errors under peak load when cache demand spikes.
Too low: Unused VRAM that could be serving requests. Artificially limits concurrency.
How it interacts with max_num_seqs:
Each concurrent sequence holds a slice of KV cache proportional to its context length. If you increase maxnumseqs without increasing available KV cache, sequences compete for cache space and evict each other — turning a throughput optimization into a latency problem.
vllm serve <model> --gpu-memory-utilization 0.85A more conservative value (0.80–0.85) is safer for production workloads with variable context lengths. The 0.90 default leaves little headroom for spikes.
Enable prefix caching if your requests share a system prompt:
vllm serve <model> --enable-prefix-cachingFor workloads with consistent system prompts, prefix caching can reduce KV cache consumption by 30–50% — effectively giving you the headroom of a more conservative utilization setting without sacrificing throughput.
---
maxmodellen: Match It to Your Workload
maxmodellen sets the maximum context window per request. A common mistake is setting this to the model's theoretical maximum (128K for some models) when the actual workload never uses more than 4K tokens.
Every token of maxmodellen reserves potential KV cache space. Oversizing it wastes VRAM that could be used for more concurrent requests.
vllm serve <model> --max-model-len 8192How to set it: Look at your p99 request length in production. Set maxmodellen to that value plus a reasonable buffer — not the model's theoretical maximum.
If your p99 request is 3,000 tokens, a maxmodellen of 8,192 is appropriate. 128K is waste.
---
Speculative Decoding: When It Helps and When It Doesn't
Speculative decoding uses a small draft model to predict several tokens ahead, then verifies them with the main model in parallel. When the acceptance rate is high, it significantly increases token throughput.
It helps when:
- Your workload produces predictable output — code generation, structured responses, templated text
- You have a good draft model for your base model (e.g., a smaller model from the same family)
- Batch sizes are small to moderate
It hurts when:
- Output is highly variable or creative (open-ended generation, chat)
- Acceptance rate is low — rejected tokens are wasted compute
- Batch sizes are large — overhead of draft model adds latency
vllm serve <model> \
--speculative-model <draft-model> \
--num-speculative-tokens 5The right number of speculative tokens depends on your acceptance rate. Start at 3–5 and measure actual throughput improvement versus baseline before committing.
---
Predicting p95 Latency Before You Deploy
The interaction between maxnumseqs, KV cache, and request concurrency makes latency difficult to predict without running load tests. But there are useful rules of thumb:
- Time to First Token (TTFT) is dominated by the prefill phase — long input prompts hurt TTFT regardless of other settings
- Time Per Output Token (TPOT) is dominated by concurrency — more concurrent sequences means longer waits between tokens
- p95 latency is where misconfiguration shows up first — the tail requests are the ones getting their cache evicted or waiting in queue
A deployment that looks healthy at p50 can be broken at p95. Always test at realistic concurrency, not a single request at a time.
---
Putting It Together
The parameters are interdependent. A configuration change that improves one metric can degrade another:
| Change | Effect on throughput | Effect on latency |
|---|---|---|
| Increase maxnumseqs | ↑ | ↑ (if KV cache is insufficient) |
| Increase gpumemoryutilization | ↑ (more cache) | ↓ (less eviction) |
| Decrease maxmodellen | ↑ (more VRAM for concurrency) | ↓ (less fragmentation) |
| Enable speculative decoding | ↑ (if acceptance rate > 70%) | ↓ (if acceptance rate < 50%) |
Rather than tuning these by hand, use the vLLM Configuration Calculator — plug in your model, GPU, and traffic profile and get back a recommended configuration including whether speculative decoding is likely to help your workload.
---
After the Initial Configuration
Getting the initial configuration right is the first step. Traffic patterns change, models get updated, and load shifts — which means a configuration that was optimal on day one may be suboptimal by month three.
Real-time observability into KV cache usage, queue depth, TTFT, and TPOT is what keeps a deployment healthy over time. The initial configuration gets you close. Continuous monitoring keeps you there.
Paralleliq's scanner surfaces KV cache pressure, idle capacity, and tier misplacement across your inference fleet in real time — so configuration drift doesn't compound into GPU waste. [Try piqc →](https://github.com/paralleliq/piqc)