Orchestration, Serving, and Execution: The Three Layers of Model Deployment

Most teams don't struggle with AI because models are hard. They struggle because three different systems — execution, serving, orchestration — are asked to behave like one.
Published: Jan 2, 2026
---
As AI models move from experimentation to production, teams often discover that _deployment_ is where complexity explodes. It's not because models are mysterious. It's because three fundamentally different systems are involved, and they are often treated as one.
Those systems are:
- Execution
- Serving
- Orchestration
Understanding what each layer does — and what it does _not_ do — is essential to building reliable, cost-effective AI systems.
The core problem: one word, three meanings
When someone says: "We deployed the model"
they might mean:
- the model runs on a GPU
- the model responds to HTTP requests
- the model is scaled and monitored in Kubernetes
These are not the same thing. They correspond to three separate layers, each with different responsibilities, failure modes, and ownership.
1. Execution: how the model actually runs
Execution is the innermost layer. This is where:
- model weights are loaded
- GPU memory is allocated
- kernels are launched
- batching happens
- tokens are generated
Execution systems are:
- model-aware
- GPU-aware
- latency-critical
Examples
- vLLM
- TensorRT-LLM
- PyTorch inference code
- ONNX Runtime
If execution fails, no inference happens. This layer determines:
- throughput
- latency
- memory pressure
- GPU utilization
Execution is mandatory. Without it, there is no model.
2. Serving: how requests reach the model
Serving wraps execution with an interface. It answers questions like:
- How do clients send requests?
- How many requests are handled concurrently?
- What happens when a request times out?
- How do we check if the model is healthy?
Serving systems provide:
- HTTP or gRPC APIs
- request parsing
- concurrency control
- basic health checks
Examples
- vLLM HTTP server
- TGI
- Triton Inference Server
- Custom FastAPI / gRPC services
In modern LLM runtimes, serving and execution are often combined. vLLM, for example, is both an execution engine _and_ a server. Without serving:
- batch or offline inference still works
- online inference does not
Serving is required for real-time production workloads, optional for offline jobs.
3. Orchestration: how the system stays alive
Orchestration is the outermost layer. It manages:
- where processes run
- how many replicas exist
- restarts on failure
- scaling decisions
- lifecycle events
Orchestration systems are:
- model-agnostic
- resource-centric
- slow-moving (seconds to minutes)
Examples
- Kubernetes
- Nomad
- Ray
- Slurm (for batch workloads)
Orchestration does _not_ understand:
- tokens
- batching
- GPU memory layouts
- model behavior
It understands:
- pods
- CPUs
- memory
- GPUs as allocatable resources
Without orchestration:
- you can still run a model
- but scaling, resilience, and operations are manual
Are all three required?
Not always — but almost always in production. Let's be precise.
Local experimentation
- Execution: yes
- Serving: no
- Orchestration: no
This is not deployment.
Offline or batch inference
- Execution: yes
- Serving: no
- Orchestration: optional
Example:
- embedding generation jobs
- nightly batch runs
Single-node online service
- Execution: yes
- Serving: yes
- Orchestration: no
Example:
- one VM
- one container
- manual restarts
Works, but fragile.
Production online inference (the common case)
- Execution: yes
- Serving: yes
- Orchestration: yes
This is where:
- GPUs are expensive
- traffic fluctuates
- failures must be handled automatically
This is also where most complexity appears.
Why things break in production
The pain doesn't come from having three layers. It comes from pretending they are one.
Common failure modes:
- Orchestration scales based on CPU, not tokens
- Serving timeouts ignore model context length
- Execution settings copied from blog posts
- GPU underutilization hidden by healthy pods
Each layer is behaving "correctly" in isolation — but incorrectly as a system.
Why Kubernetes alone is not enough
Kubernetes is excellent at:
- keeping processes alive
- allocating resources
- restarting failures
It does not understand:
- batch collapse
- KV cache pressure
- token-level latency
- model-specific constraints
That semantic gap is why LLM autoscaling is hard and why "CPU-based HPA" often fails.
The missing abstraction: intent
What's missing in most deployments is a way to express:
- what the model expects
- what "good performance" means
- what constraints must not be violated
Those concepts don't belong exclusively to:
- execution
- serving
- or orchestration
They sit above all three.
Without that layer, teams rely on:
- tribal knowledge
- fragile defaults
- reactive tuning
Why this distinction matters
Once you see the separation clearly:
- configuration becomes reviewable
- failures become explainable
- automation becomes safer
It also becomes obvious why:
- monitoring alone isn't enough
- optimization without context is risky
- "just tune the flags" doesn't scale
A healthier mental model
A robust deployment pipeline looks like this:
Execution runs the model
Serving exposes the model
Orchestration manages the model
Intent defines how they should align
When intent is explicit, tools can:
- validate assumptions early
- detect drift in production
- guide corrective action
Closing thought
Most teams don't struggle with AI because models are hard. They struggle because three different systems are asked to behave like one, without a shared understanding of intent. Once you separate:
- execution
- serving
- orchestration
the complexity becomes manageable — and the path to reliable production becomes much clearer.