AI Infrastructure

How to Configure vLLM for Production

By Sam Hosseini·June 3, 2026·8 min read

vLLM configuration is normally done through trial and error. Wrong max_num_seqs, misconfigured KV cache, or a bad speculative decoding decision can silently destroy throughput and latency. Here's how to get it right before you touch a cluster.

vLLM configuration is normally done through trial and error. Engineers pick a starting point, deploy, watch latency, adjust, redeploy. It works eventually — but it's slow, and misconfigured deployments can silently destroy throughput and inflate GPU spend for weeks before anyone notices.

This guide covers the four decisions that matter most: maxnumseqs, KV cache allocation, speculative decoding, and context length. Get these right and you'll land close to your performance targets on the first deployment.

---

The Configuration Problem

vLLM exposes dozens of parameters. Most of them don't matter much. Four of them interact in ways that can make or break a deployment:

max_num_seqs — how many requests run concurrently
gpu_memory_utilization — how much VRAM is reserved for the KV cache
max_model_len — the maximum context window per request
speculative decoding — whether to use a draft model to accelerate generation

These four parameters are deeply interdependent. Setting maxnumseqs too high without enough KV cache will cause OOM errors. Setting gpumemoryutilization too high leaves no room for the model weights. Getting speculative decoding wrong can increase latency instead of reducing it.

The right values depend on your model, GPU tier, average request length, and target throughput — not on defaults.

---

maxnumseqs: The Concurrency Setting

maxnumseqs controls how many sequences vLLM processes in a single batch. It is the most important parameter for throughput.

Setting it too low: GPU sits underutilized between requests. Throughput is limited even when the GPU has capacity.

Setting it too high: Each concurrent sequence requires KV cache space. Too many sequences exhaust VRAM, triggering OOM errors or forcing cache evictions that hurt latency more than concurrency helps throughput.

How to set it:

The right value is determined by how much VRAM is available after model weights are loaded, divided by the KV cache cost per sequence at your typical context length.

vllm serve <model> --max-num-seqs 32

A rough starting point by model size and GPU:

Model	GPU	Starting maxnumseqs
7B	A10G (24GB)	32–64
7B	A100-40GB	64–128
13B	A100-40GB	16–32
70B	8xA100-80GB	8–16
70B	8xH100-80GB	16–32

These are starting points. Use the vLLM Configuration Calculator to get a recommendation based on your specific model, GPU, and traffic profile.

---

KV Cache Allocation: gpumemoryutilization

gpumemoryutilization controls what fraction of GPU VRAM is reserved for the KV cache after model weights are loaded. The default is 0.9 (90%).

Too high: Risk of OOM errors under peak load when cache demand spikes.

Too low: Unused VRAM that could be serving requests. Artificially limits concurrency.

How it interacts with max_num_seqs:

Each concurrent sequence holds a slice of KV cache proportional to its context length. If you increase maxnumseqs without increasing available KV cache, sequences compete for cache space and evict each other — turning a throughput optimization into a latency problem.

vllm serve <model> --gpu-memory-utilization 0.85

A more conservative value (0.80–0.85) is safer for production workloads with variable context lengths. The 0.90 default leaves little headroom for spikes.

Enable prefix caching if your requests share a system prompt:

vllm serve <model> --enable-prefix-caching

For workloads with consistent system prompts, prefix caching can reduce KV cache consumption by 30–50% — effectively giving you the headroom of a more conservative utilization setting without sacrificing throughput.

---

maxmodellen: Match It to Your Workload

maxmodellen sets the maximum context window per request. A common mistake is setting this to the model's theoretical maximum (128K for some models) when the actual workload never uses more than 4K tokens.

Every token of maxmodellen reserves potential KV cache space. Oversizing it wastes VRAM that could be used for more concurrent requests.

vllm serve <model> --max-model-len 8192

How to set it: Look at your p99 request length in production. Set maxmodellen to that value plus a reasonable buffer — not the model's theoretical maximum.

If your p99 request is 3,000 tokens, a maxmodellen of 8,192 is appropriate. 128K is waste.

---

Speculative Decoding: When It Helps and When It Doesn't

Speculative decoding uses a small draft model to predict several tokens ahead, then verifies them with the main model in parallel. When the acceptance rate is high, it significantly increases token throughput.

It helps when:

Your workload produces predictable output — code generation, structured responses, templated text
You have a good draft model for your base model (e.g., a smaller model from the same family)
Batch sizes are small to moderate

It hurts when:

Output is highly variable or creative (open-ended generation, chat)
Acceptance rate is low — rejected tokens are wasted compute
Batch sizes are large — overhead of draft model adds latency

vllm serve <model> \
  --speculative-model <draft-model> \
  --num-speculative-tokens 5

The right number of speculative tokens depends on your acceptance rate. Start at 3–5 and measure actual throughput improvement versus baseline before committing.

---

Predicting p95 Latency Before You Deploy

The interaction between maxnumseqs, KV cache, and request concurrency makes latency difficult to predict without running load tests. But there are useful rules of thumb:

Time to First Token (TTFT) is dominated by the prefill phase — long input prompts hurt TTFT regardless of other settings
Time Per Output Token (TPOT) is dominated by concurrency — more concurrent sequences means longer waits between tokens
p95 latency is where misconfiguration shows up first — the tail requests are the ones getting their cache evicted or waiting in queue

A deployment that looks healthy at p50 can be broken at p95. Always test at realistic concurrency, not a single request at a time.

---

Putting It Together

The parameters are interdependent. A configuration change that improves one metric can degrade another:

Change	Effect on throughput	Effect on latency
Increase maxnumseqs	↑	↑ (if KV cache is insufficient)
Increase gpumemoryutilization	↑ (more cache)	↓ (less eviction)
Decrease maxmodellen	↑ (more VRAM for concurrency)	↓ (less fragmentation)
Enable speculative decoding	↑ (if acceptance rate > 70%)	↓ (if acceptance rate < 50%)

Rather than tuning these by hand, use the vLLM Configuration Calculator — plug in your model, GPU, and traffic profile and get back a recommended configuration including whether speculative decoding is likely to help your workload.

---

After the Initial Configuration

Getting the initial configuration right is the first step. Traffic patterns change, models get updated, and load shifts — which means a configuration that was optimal on day one may be suboptimal by month three.

Real-time observability into KV cache usage, queue depth, TTFT, and TPOT is what keeps a deployment healthy over time. The initial configuration gets you close. Continuous monitoring keeps you there.

Paralleliq's scanner surfaces KV cache pressure, idle capacity, and tier misplacement across your inference fleet in real time — so configuration drift doesn't compound into GPU waste. [Try piqc →](https://github.com/paralleliq/piqc)

How to Configure vLLM for Production

The Configuration Problem

maxnumseqs: The Concurrency Setting

KV Cache Allocation: gpumemoryutilization

maxmodellen: Match It to Your Workload

Speculative Decoding: When It Helps and When It Doesn't

Predicting p95 Latency Before You Deploy

Putting It Together

After the Initial Configuration

More articles

Why GPU Fleet Management Needs a Tenant Model

The One Sequence That's Killing Your LLM Inference Performance

10 GPU Fleet Findings — And Who Each One Matters To

Get more from the cluster you already have.