Heading Background
Heading Background
Heading Background
AI/ML Model Operations

Orchestration, Serving, and Execution: The Three Layers of Model Deployment

As AI models move from experimentation to production, teams often discover that deployment is where complexity explodes. It’s not because models are mysterious. It’s because three fundamentally different systems are involved, and they are often treated as one.

Those systems are:

  1. Execution

  2. Serving

  3. Orchestration

Understanding what each layer does — and what it does not do — is essential to building reliable, cost-effective AI systems.

The core problem: one word, three meanings

When someone says: “We deployed the model”

they might mean:

  • the model runs on a GPU

  • the model responds to HTTP requests

  • the model is scaled and monitored in Kubernetes

These are not the same thing. They correspond to three separate layers, each with different responsibilities, failure modes, and ownership.

1. Execution: how the model actually runs

Execution is the innermost layer. This is where:

  • model weights are loaded

  • GPU memory is allocated

  • kernels are launched

  • batching happens

  • tokens are generated

Execution systems are:

  • model-aware

  • GPU-aware

  • latency-critical

Examples

  • vLLM

  • TensorRT-LLM

  • PyTorch inference code

  • ONNX Runtime

If execution fails, no inference happens. This layer determines:

  • throughput

  • latency

  • memory pressure

  • GPU utilization

Execution is mandatory. Without it, there is no model.

2. Serving: how requests reach the model

Serving wraps execution with an interface. It answers questions like:

  • How do clients send requests?

  • How many requests are handled concurrently?

  • What happens when a request times out?

  • How do we check if the model is healthy?

Serving systems provide:

  • HTTP or gRPC APIs

  • request parsing

  • concurrency control

  • basic health checks

Examples

  • vLLM HTTP server

  • TGI

  • Triton Inference Server

  • Custom FastAPI / gRPC services

In modern LLM runtimes, serving and execution are often combined. vLLM, for example, is both an execution engine and a server. Without serving:

  • batch or offline inference still works

  • online inference does not

Serving is required for real-time production workloads, optional for offline jobs.

3. Orchestration: how the system stays alive

Orchestration is the outermost layer. It manages:

  • where processes run

  • how many replicas exist

  • restarts on failure

  • scaling decisions

  • lifecycle events

Orchestration systems are:

  • model-agnostic

  • resource-centric

  • slow-moving (seconds to minutes)

Examples

  • Kubernetes

  • Nomad

  • Ray

  • Slurm (for batch workloads)

Orchestration does not understand:

  • tokens

  • batching

  • GPU memory layouts

  • model behavior

It understands:

  • pods

  • CPUs

  • memory

  • GPUs as allocatable resources

Without orchestration:

  • you can still run a model

  • but scaling, resilience, and operations are manual

Are all three required?

Not always — but almost always in production. Let’s be precise.

Local experimentation

  • Execution: ✅

  • Serving: ❌

  • Orchestration: ❌

Press enter or click to view image in full size


This is not deployment.

Offline or batch inference

  • Execution: ✅

  • Serving: ❌

  • Orchestration: ⚠️ optional

Example:

  • embedding generation jobs

  • nightly batch runs

Single-node online service

  • Execution: ✅

  • Serving: ✅

  • Orchestration: ❌

Example:

  • one VM

  • one container

  • manual restarts

Works, but fragile.

Production online inference (the common case)

  • Execution: ✅

  • Serving: ✅

  • Orchestration: ✅

This is where:

  • GPUs are expensive

  • traffic fluctuates

  • failures must be handled automatically

This is also where most complexity appears.

Why things break in production

The pain doesn’t come from having three layers. It comes from pretending they are one.

Common failure modes:

  • Orchestration scales based on CPU, not tokens

  • Serving timeouts ignore model context length

  • Execution settings copied from blog posts

  • GPU underutilization hidden by healthy pods

Each layer is behaving “correctly” in isolation — but incorrectly as a system.

Why Kubernetes alone is not enough

Kubernetes is excellent at:

  • keeping processes alive

  • allocating resources

  • restarting failures

It does not understand:

  • batch collapse

  • KV cache pressure

  • token-level latency

  • model-specific constraints

That semantic gap is why LLM autoscaling is hard and why “CPU-based HPA” often fails.

The missing abstraction: intent

What’s missing in most deployments is a way to express:

  • what the model expects

  • what “good performance” means

  • what constraints must not be violated

Those concepts don’t belong exclusively to:

  • execution

  • serving

  • or orchestration

They sit above all three.

Without that layer, teams rely on:

  • tribal knowledge

  • fragile defaults

  • reactive tuning

Why this distinction matters

Once you see the separation clearly:

  • configuration becomes reviewable

  • failures become explainable

  • automation becomes safer

It also becomes obvious why:

  • monitoring alone isn’t enough

  • optimization without context is risky

  • “just tune the flags” doesn’t scale

A healthier mental model

A robust deployment pipeline looks like this:

Execution runs the model

Serving exposes the model

Orchestration manages the model

Intent defines how they should align

When intent is explicit, tools can:

  • validate assumptions early

  • detect drift in production

  • guide corrective action

Closing thought

Most teams don’t struggle with AI because models are hard. They struggle because three different systems are asked to behave like one, without a shared understanding of intent. Once you separate:

  • execution

  • serving

  • orchestration

the complexity becomes manageable — and the path to reliable production becomes much clearer.

As AI models move from experimentation to production, teams often discover that deployment is where complexity explodes. It’s not because models are mysterious. It’s because three fundamentally different systems are involved, and they are often treated as one.

Those systems are:

  1. Execution

  2. Serving

  3. Orchestration

Understanding what each layer does — and what it does not do — is essential to building reliable, cost-effective AI systems.

The core problem: one word, three meanings

When someone says: “We deployed the model”

they might mean:

  • the model runs on a GPU

  • the model responds to HTTP requests

  • the model is scaled and monitored in Kubernetes

These are not the same thing. They correspond to three separate layers, each with different responsibilities, failure modes, and ownership.

1. Execution: how the model actually runs

Execution is the innermost layer. This is where:

  • model weights are loaded

  • GPU memory is allocated

  • kernels are launched

  • batching happens

  • tokens are generated

Execution systems are:

  • model-aware

  • GPU-aware

  • latency-critical

Examples

  • vLLM

  • TensorRT-LLM

  • PyTorch inference code

  • ONNX Runtime

If execution fails, no inference happens. This layer determines:

  • throughput

  • latency

  • memory pressure

  • GPU utilization

Execution is mandatory. Without it, there is no model.

2. Serving: how requests reach the model

Serving wraps execution with an interface. It answers questions like:

  • How do clients send requests?

  • How many requests are handled concurrently?

  • What happens when a request times out?

  • How do we check if the model is healthy?

Serving systems provide:

  • HTTP or gRPC APIs

  • request parsing

  • concurrency control

  • basic health checks

Examples

  • vLLM HTTP server

  • TGI

  • Triton Inference Server

  • Custom FastAPI / gRPC services

In modern LLM runtimes, serving and execution are often combined. vLLM, for example, is both an execution engine and a server. Without serving:

  • batch or offline inference still works

  • online inference does not

Serving is required for real-time production workloads, optional for offline jobs.

3. Orchestration: how the system stays alive

Orchestration is the outermost layer. It manages:

  • where processes run

  • how many replicas exist

  • restarts on failure

  • scaling decisions

  • lifecycle events

Orchestration systems are:

  • model-agnostic

  • resource-centric

  • slow-moving (seconds to minutes)

Examples

  • Kubernetes

  • Nomad

  • Ray

  • Slurm (for batch workloads)

Orchestration does not understand:

  • tokens

  • batching

  • GPU memory layouts

  • model behavior

It understands:

  • pods

  • CPUs

  • memory

  • GPUs as allocatable resources

Without orchestration:

  • you can still run a model

  • but scaling, resilience, and operations are manual

Are all three required?

Not always — but almost always in production. Let’s be precise.

Local experimentation

  • Execution: ✅

  • Serving: ❌

  • Orchestration: ❌

Press enter or click to view image in full size


This is not deployment.

Offline or batch inference

  • Execution: ✅

  • Serving: ❌

  • Orchestration: ⚠️ optional

Example:

  • embedding generation jobs

  • nightly batch runs

Single-node online service

  • Execution: ✅

  • Serving: ✅

  • Orchestration: ❌

Example:

  • one VM

  • one container

  • manual restarts

Works, but fragile.

Production online inference (the common case)

  • Execution: ✅

  • Serving: ✅

  • Orchestration: ✅

This is where:

  • GPUs are expensive

  • traffic fluctuates

  • failures must be handled automatically

This is also where most complexity appears.

Why things break in production

The pain doesn’t come from having three layers. It comes from pretending they are one.

Common failure modes:

  • Orchestration scales based on CPU, not tokens

  • Serving timeouts ignore model context length

  • Execution settings copied from blog posts

  • GPU underutilization hidden by healthy pods

Each layer is behaving “correctly” in isolation — but incorrectly as a system.

Why Kubernetes alone is not enough

Kubernetes is excellent at:

  • keeping processes alive

  • allocating resources

  • restarting failures

It does not understand:

  • batch collapse

  • KV cache pressure

  • token-level latency

  • model-specific constraints

That semantic gap is why LLM autoscaling is hard and why “CPU-based HPA” often fails.

The missing abstraction: intent

What’s missing in most deployments is a way to express:

  • what the model expects

  • what “good performance” means

  • what constraints must not be violated

Those concepts don’t belong exclusively to:

  • execution

  • serving

  • or orchestration

They sit above all three.

Without that layer, teams rely on:

  • tribal knowledge

  • fragile defaults

  • reactive tuning

Why this distinction matters

Once you see the separation clearly:

  • configuration becomes reviewable

  • failures become explainable

  • automation becomes safer

It also becomes obvious why:

  • monitoring alone isn’t enough

  • optimization without context is risky

  • “just tune the flags” doesn’t scale

A healthier mental model

A robust deployment pipeline looks like this:

Execution runs the model

Serving exposes the model

Orchestration manages the model

Intent defines how they should align

When intent is explicit, tools can:

  • validate assumptions early

  • detect drift in production

  • guide corrective action

Closing thought

Most teams don’t struggle with AI because models are hard. They struggle because three different systems are asked to behave like one, without a shared understanding of intent. Once you separate:

  • execution

  • serving

  • orchestration

the complexity becomes manageable — and the path to reliable production becomes much clearer.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.