Heading Background
AI/ML Model Operations

The Missing Control Plane for GPU Platforms: Policy as Code, Not Just Schedulers

GPU platforms have become the backbone of modern AI. Startups and enterprises now deploy large language models, embedding pipelines, and inference services on clusters of high-end GPUs. Yet despite all this sophistication, most GPU platforms today are still operated using a fragile mix of YAML, scripts, tickets, and human judgment.

What’s missing is not a better scheduler. What’s missing is a real control plane.

GPUs Are Being Sold as Products, But Operated Like Infrastructure

When customers buy GPU capacity, they are not buying machines. They are buying product guarantees such as:

  • entitlements (“I paid for 4 H100s”)

  • SLAs (“low latency, no preemption”)

  • isolation (“no noisy neighbors”)

  • compliance (“EU-only, dedicated hardware”)

  • lifecycle semantics (“scale up, scale down, offboard cleanly”)

But today, those guarantees live in contracts, internal runbooks, tribal knowledge and best-effort operational discipline. They are not encoded into the system. That gap is why GPU platforms struggle with:

  • noisy neighbors

  • broken SLAs

  • unpredictable onboarding

  • chaotic scaling

  • compliance gaps

  • constant manual firefighting

A Control Plane Is Not an Orchestrator

Most GPU platforms assume their “control plane” is Kubernetes, Slurm, or a scheduler. It isn’t. Those systems decide how workloads run. A real control plane decides:

  • who is allowed to run what

  • where workloads may be placed

  • how much capacity a tenant is entitled to

  • whether a request must be approved

  • what isolation rules apply

  • what happens on failure

  • how data must be retained or destroyed

Those are product and governance decisions, not scheduling decisions.

The 4-Lane Blueprint for a GPUaaS Control Plane

To make this concrete, I built a BPM-style swimlane blueprint that separates responsibilities into four distinct layers:

Press enter or click to view image in full size

Lane 1 — Northbound Product API / UI

Tenant and admin actions such as:

  • create tenant

  • deploy model

  • request scale-up

  • change batch size

  • request GPU sharing

  • update quotas

  • offboard tenant

These are product-level intents.

Lane 2 — Product Control Plane

Policy-as-Code • Workflow-as-Code • Lifecycle State Machines

This is the missing layer. It owns:

  • tenant entitlements and quotas

  • SLA tiers and isolation rules

  • admission control

  • placement policies

  • privacy and compliance rules

  • degradation and preemption policies

  • onboarding and offboarding workflows

Crucially, these are not scripts. They are:

  • declarative policies

  • versioned rules

  • durable workflows

  • explicit state machines

This is where Policy as Code actually lives.

Lane 3 — Orchestration & Substrate Control

This layer turns control-plane decisions into reality:

  • Kubernetes

  • Slurm / Kueue

  • Gateway API / inference gateways

  • GPU operators

  • Helm / Argo CD

  • Terraform / Ansible

  • CNI / CSI

It does not decide policy. It only realizes policy.

Lane 4 — Execution & Hardware

Where work actually happens:

  • vLLM, Triton, TGI

  • PyTorch

  • ROCm / CUDA

  • GPUs

  • NICs

  • NVMe

This is the physical reality layer.

Policy as Code: The Core of the Control Plane

The key idea is that GPU platforms need a policy-as-code control plane, not just a scheduler. Instead of writing documents like:

“Enterprise tenants must run on dedicated GPUs in EU clusters and cannot be preempted.”

You encode:

# Example: tenantpolicy-as-code configuration
tenant:
   tier: enterprise  
   isolation: dedicated
   data_residency: eu-only
   allow_preemption: false
   allow_gpu_sharing: false
# Example: tenantpolicy-as-code configuration
tenant:
   tier: enterprise  
   isolation: dedicated
   data_residency: eu-only
   allow_preemption: false
   allow_gpu_sharing: false
# Example: tenantpolicy-as-code configuration
tenant:
   tier: enterprise  
   isolation: dedicated
   data_residency: eu-only
   allow_preemption: false
   allow_gpu_sharing: false

And the control plane enforces it by:

  • rejecting invalid requests

  • constraining placement

  • selecting eligible clusters

  • configuring schedulers

  • applying isolation rules

  • blocking unsafe actions

This is exactly how hyperscalers work internally. GPU platforms today simply don’t have this layer.

Workflows, Not Scripts

A real control plane is workflow-driven. Examples:

  • tenant onboarding

  • quota increases

  • SLA upgrades

  • capacity exhaustion

  • incident response

  • offboarding and data destruction

These are not bash scripts. They are:

  • long-running processes

  • with approvals

  • retries

  • compensations

  • audit trails

  • explicit state transitions

That’s why Lane 2 must be implemented as: Policy-as-Code + Workflow-as-Code + Durable State Machines

Why Is This Important

Without this layer, GPU platforms are forced to operate like fragile infrastructure projects instead of reliable products. That’s why we see:

  • unpredictable performance

  • noisy neighbors

  • broken SLAs

  • chaotic onboarding

  • manual firefighting

  • compliance gaps

A programmable control plane fixes this by turning:

  • product promises

  • governance rules

  • lifecycle semantics

into machine-enforceable reality.

Closing Thought

Kubernetes orchestrates containers. Slurm schedules jobs. vLLM runs inference. Llm-d routes and scales inference traffic. But none of those systems know what a tenant, an SLA, or a privacy tier is. That knowledge belongs in a real control plane. And that control plane must be built on policy as code, not scripts.

Closing

I’m open-sourcing parts of this control-plane blueprint and documenting the workflows, policies, and lifecycle semantics behind it. You can find the blueprint here.

If you’re building a GPU platform or GPUaaS product and wrestling with multi-tenancy, SLAs, or onboarding reliability, I’d love to compare notes.

GPU platforms have become the backbone of modern AI. Startups and enterprises now deploy large language models, embedding pipelines, and inference services on clusters of high-end GPUs. Yet despite all this sophistication, most GPU platforms today are still operated using a fragile mix of YAML, scripts, tickets, and human judgment.

What’s missing is not a better scheduler. What’s missing is a real control plane.

GPUs Are Being Sold as Products, But Operated Like Infrastructure

When customers buy GPU capacity, they are not buying machines. They are buying product guarantees such as:

  • entitlements (“I paid for 4 H100s”)

  • SLAs (“low latency, no preemption”)

  • isolation (“no noisy neighbors”)

  • compliance (“EU-only, dedicated hardware”)

  • lifecycle semantics (“scale up, scale down, offboard cleanly”)

But today, those guarantees live in contracts, internal runbooks, tribal knowledge and best-effort operational discipline. They are not encoded into the system. That gap is why GPU platforms struggle with:

  • noisy neighbors

  • broken SLAs

  • unpredictable onboarding

  • chaotic scaling

  • compliance gaps

  • constant manual firefighting

A Control Plane Is Not an Orchestrator

Most GPU platforms assume their “control plane” is Kubernetes, Slurm, or a scheduler. It isn’t. Those systems decide how workloads run. A real control plane decides:

  • who is allowed to run what

  • where workloads may be placed

  • how much capacity a tenant is entitled to

  • whether a request must be approved

  • what isolation rules apply

  • what happens on failure

  • how data must be retained or destroyed

Those are product and governance decisions, not scheduling decisions.

The 4-Lane Blueprint for a GPUaaS Control Plane

To make this concrete, I built a BPM-style swimlane blueprint that separates responsibilities into four distinct layers:

Press enter or click to view image in full size

Lane 1 — Northbound Product API / UI

Tenant and admin actions such as:

  • create tenant

  • deploy model

  • request scale-up

  • change batch size

  • request GPU sharing

  • update quotas

  • offboard tenant

These are product-level intents.

Lane 2 — Product Control Plane

Policy-as-Code • Workflow-as-Code • Lifecycle State Machines

This is the missing layer. It owns:

  • tenant entitlements and quotas

  • SLA tiers and isolation rules

  • admission control

  • placement policies

  • privacy and compliance rules

  • degradation and preemption policies

  • onboarding and offboarding workflows

Crucially, these are not scripts. They are:

  • declarative policies

  • versioned rules

  • durable workflows

  • explicit state machines

This is where Policy as Code actually lives.

Lane 3 — Orchestration & Substrate Control

This layer turns control-plane decisions into reality:

  • Kubernetes

  • Slurm / Kueue

  • Gateway API / inference gateways

  • GPU operators

  • Helm / Argo CD

  • Terraform / Ansible

  • CNI / CSI

It does not decide policy. It only realizes policy.

Lane 4 — Execution & Hardware

Where work actually happens:

  • vLLM, Triton, TGI

  • PyTorch

  • ROCm / CUDA

  • GPUs

  • NICs

  • NVMe

This is the physical reality layer.

Policy as Code: The Core of the Control Plane

The key idea is that GPU platforms need a policy-as-code control plane, not just a scheduler. Instead of writing documents like:

“Enterprise tenants must run on dedicated GPUs in EU clusters and cannot be preempted.”

You encode:

# Example: tenantpolicy-as-code configuration
tenant:
   tier: enterprise  
   isolation: dedicated
   data_residency: eu-only
   allow_preemption: false
   allow_gpu_sharing: false

And the control plane enforces it by:

  • rejecting invalid requests

  • constraining placement

  • selecting eligible clusters

  • configuring schedulers

  • applying isolation rules

  • blocking unsafe actions

This is exactly how hyperscalers work internally. GPU platforms today simply don’t have this layer.

Workflows, Not Scripts

A real control plane is workflow-driven. Examples:

  • tenant onboarding

  • quota increases

  • SLA upgrades

  • capacity exhaustion

  • incident response

  • offboarding and data destruction

These are not bash scripts. They are:

  • long-running processes

  • with approvals

  • retries

  • compensations

  • audit trails

  • explicit state transitions

That’s why Lane 2 must be implemented as: Policy-as-Code + Workflow-as-Code + Durable State Machines

Why Is This Important

Without this layer, GPU platforms are forced to operate like fragile infrastructure projects instead of reliable products. That’s why we see:

  • unpredictable performance

  • noisy neighbors

  • broken SLAs

  • chaotic onboarding

  • manual firefighting

  • compliance gaps

A programmable control plane fixes this by turning:

  • product promises

  • governance rules

  • lifecycle semantics

into machine-enforceable reality.

Closing Thought

Kubernetes orchestrates containers. Slurm schedules jobs. vLLM runs inference. Llm-d routes and scales inference traffic. But none of those systems know what a tenant, an SLA, or a privacy tier is. That knowledge belongs in a real control plane. And that control plane must be built on policy as code, not scripts.

Closing

I’m open-sourcing parts of this control-plane blueprint and documenting the workflows, policies, and lifecycle semantics behind it. You can find the blueprint here.

If you’re building a GPU platform or GPUaaS product and wrestling with multi-tenancy, SLAs, or onboarding reliability, I’d love to compare notes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Don’t let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.