ParallelIQ
Infrastructure

The LLM Inference Autoscaling Stack: What Each Layer Solves — and the Gap None of Them Close

By Sam Hosseini·May 21, 2026·11 min read
The LLM Inference Autoscaling Stack: What Each Layer Solves — and the Gap None of Them Close

KEDA, Thoras.ai, llm-d, NVIDIA Dynamo, KServe, Run:ai — each is real, each is useful. Here's what each layer of the inference autoscaling stack actually covers, and what the entire stack leaves unaddressed.

The Crowded Field

If you're running LLM inference on Kubernetes in 2026, you have more autoscaling options than you probably realize — and fewer than you actually need.

KEDA has become the default event-driven scaler for vLLM pods. Thoras.ai is adding ML-based predictive scaling. llm-d just joined the CNCF Sandbox with backing from Red Hat, Google, IBM, NVIDIA, and CoreWeave. NVIDIA Dynamo runs separate scaling loops for prefill and decode phases. KServe is evolving from model server to serving control plane. Run:ai — now part of NVIDIA — handles fractional GPUs and multi-model enterprise scaling.

Everyone is working on inference autoscaling. The question worth asking: what does each layer actually solve, and what does the entire stack leave unaddressed?

Layer 1: Event-Driven Scaling — KEDA

KEDA (Kubernetes Event-Driven Autoscaling) is the CNCF project most teams reach for first. It scales pods based on external metrics: queue depth, Kafka lag, Prometheus metrics, request counts. For LLM inference, it can ingest vLLM-native metrics like pending requests, KV cache utilization, and token generation rate.

KEDA is excellent at what it does. If you give it the right signals, it scales inference pods up and down appropriately. Microsoft's AKS team has built explicit integrations between KEDA, KAITO, and vLLM — if you're on Azure, this path is increasingly well-paved. KServe + KEDA enables scale-to-zero, eliminating idle cost entirely for low-traffic deployments.

What KEDA doesn't know: which model is running. KEDA sees a metric crossing a threshold and responds with a replica delta. It has no concept of whether the pods it's scaling are serving a 7B model that belongs on an A10G or a 70B model that requires an H100. It scales horizontally — more pods — and leaves every other question to you.

That's fine. KEDA was never designed to be model-aware. But it means "we use KEDA" is not the same as "we have inference autoscaling handled."

Layer 2: Predictive Scaling — Thoras.ai

Thoras.ai adds a machine learning layer on top of Kubernetes scaling. Instead of reacting to a metric crossing a threshold, Thoras forecasts demand based on historical CPU, memory, and traffic patterns and scales proactively — before the spike arrives. It deploys entirely inside your cluster via Helm and integrates with Prometheus. For teams with predictable traffic patterns — daily cycles, weekly peaks — this is a meaningful improvement over reactive scaling.

The limitation is the same as KEDA's: Thoras forecasts resource consumption patterns. It doesn't know what model is running, what tier that model belongs on, or whether the workload is placed optimally. It learns that your GPU nodes spike on Tuesday mornings and scales ahead of that. It doesn't know that your Tuesday morning spike is a 7B model consuming 3x the memory bandwidth it actually needs because it's on an H100 rather than an A10G.

Thoras is also fully autonomous — it scales without a human approval step. For most Kubernetes workloads, that's fine. For production inference clusters with compliance requirements or multi-stakeholder governance, that's a trade-off worth examining.

Layer 3: Serving-Layer Autoscaling — llm-d and NVIDIA Dynamo

This is where the stack gets genuinely sophisticated.

llm-d is a CNCF Sandbox project (joined March 2026) backed by Red Hat, Google Cloud, IBM Research, CoreWeave, NVIDIA, and a consortium of academic institutions. Its Workload Variant Autoscaler handles disaggregated inference: scaling prefill and decode phases independently, routing requests based on KV cache state, and optimizing for throughput at the serving layer. Disaggregated prefill/decode is one of the more significant architectural shifts in inference infrastructure — running prefill and decode on separate GPU pools with independent scaling targets is meaningfully different from treating them as a single homogeneous workload.

NVIDIA Dynamo runs separate scaling loops for prefill and decode phases, forecasts demand using time-series models, and targets latency SLAs (TTFT, inter-token latency) rather than just queue depth. It calculates replicas based on profiled per-GPU throughput curves — meaning it understands the performance characteristics of the model it's serving.

Both are powerful. Both are also solving a different problem: optimizing serving throughput within an already-provisioned fleet. They assume the right GPUs are already in place. Neither raises the question of whether the GPU tier is correctly matched to the model, or whether the fleet itself is well-configured before any scaling decision is made.

KServe sits slightly apart — it's a model serving platform that handles lifecycle management, canary deployments, traffic routing, and autoscaling in a unified interface. It's increasingly used as a serving control plane, and its KEDA integration enables scale-to-zero. Like the others, it is infrastructure-aware rather than model-economics-aware.

Layer 4: GPU Platform Scaling — Run:ai and CoreWeave

Run:ai (now part of NVIDIA) and CoreWeave address autoscaling at the GPU cloud level. Run:ai handles fractional GPU allocation, multi-tenant scheduling, and enterprise-grade scaling across large fleets. CoreWeave is a Kubernetes-native GPU cloud with orchestration and autoscaling built into the platform.

These are full-stack approaches — they own hardware, networking, and orchestration together. Within their ecosystems, they handle GPU sharing, job scheduling, and scaling reasonably well. For teams running on-premises or on a different cloud provider, they're not a factor.

The Gap the Entire Stack Leaves

Every layer described above answers a version of the same question: how do we scale this workload to meet demand?

None of them asks a prior question: is this workload correctly placed to begin with?

That distinction creates a class of problems invisible to every layer of the autoscaling stack.

Horizontal-only decisions

Kubernetes autoscaling is inherently horizontal — more pods, more nodes. For LLM inference, the more fundamental question is sometimes vertical: this model needs a different GPU tier, not more replicas of the current one. A 7B model on an H100 that is autoscaling to meet demand is efficiently scaling a misconfigured deployment. The right fix is a tier move, not a replica count adjustment. No autoscaler in this stack makes that recommendation, because none of them carries model-to-tier knowledge.

Tier misplacement amplified by scaling

When a model is on the wrong GPU tier — consuming more memory bandwidth than it needs, running on hardware priced significantly above its requirements — autoscaling amplifies the cost. Every additional replica multiplies the misplacement cost. A team that has correctly wired up KEDA and watches their inference scale cleanly with demand may be efficiently scaling their way through a substantial amount of unnecessary GPU spend.

CPU:GPU imbalance misdiagnosed as GPU pressure

In agentic workloads with tool calls and orchestration loops, CPU saturation can throttle GPU throughput. The symptom — low GPU throughput, queued requests — looks like GPU pressure. The autoscaler responds by adding GPU replicas. The real fix is CPU scaling. A model-unaware autoscaler sees the metric; it doesn't see the cause. Additional GPU replicas in this scenario do not resolve the throughput problem, they just cost more.

Dark capacity

GPUs allocated to deployments that aren't receiving traffic look fine from an autoscaling perspective — they're provisioned, they're ready. An autoscaler doesn't surface them as waste because they're not generating a scaling event. They're just sitting there, costing money. KEDA's scale-to-zero partially addresses this, but only for workloads configured with the right idle-detection logic.

No economics

Every tool in this stack expresses findings as metrics: utilization percentages, replica counts, latency targets. None of them translate a misconfigured deployment into dollars per month of unnecessary spend. The conversion from technical metric to business impact — from "GPU at 5% utilization" to "this model is costing $X/month more than it should" — is consistently absent at every layer.

No governance

Most of this stack is built to act autonomously. For teams with compliance requirements, change-management processes, or multi-stakeholder approval chains, autonomous scaling decisions against production GPU clusters are not always acceptable. A human-in-the-loop approval layer — with a full audit trail of who approved what, when, and what changed — is outside what any of these tools provides.

The Right Tool for the Right Layer

This is not an argument against KEDA, Thoras, llm-d, or Dynamo. They are real tools solving real problems at the serving layer. Teams building production inference infrastructure should understand them and use the ones that fit their stack.

The argument is that serving-layer autoscaling and control-plane-layer management are different problems at different levels:

  • Serving layer — KEDA, Thoras, llm-d, Dynamo, KServe: how many replicas, how fast, at what latency target — optimizing throughput within a provisioned fleet
  • Control plane layer — Paralleliq: cluster registration, fleet inventory, model-aware placement, cost quantification, remediation workflows, audit trail — operating the fleet itself with production-grade governance

A control plane for AI clusters does what none of the serving-layer tools do: it connects to every cluster in your fleet, ingests continuous facts about what is running (model identity, VRAM consumption, memory bandwidth, inference traffic), reasons over that data to detect problems and quantify their cost in dollars, delivers specific recommendations through a human-in-the-loop approval workflow, and records every action in a tamper-evident audit log. That is the management layer. The serving-layer tools live inside it, not above it.

A team running llm-d for serving-layer throughput optimization can simultaneously run Paralleliq as the control plane for the fleet those workloads run on. These tools do not compete. They don't overlap. They answer different questions at different layers.

The mistake is assuming that having a sophisticated serving-layer autoscaler means the control plane problem is covered. The 20–40% of GPU spend that disappears in production inference fleets is not lost to poor serving-layer scaling decisions. It's lost to models on the wrong tiers, GPUs allocated to dead deployments, and scale actions made without visibility into what they actually cost.

---

_Paralleliq is a model-aware GPU control plane for AI inference fleets. Start with piqc — the open-source GPU waste scanner — or contact us to discuss the full control plane for your fleet._

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free