ParallelIQ
AI Infrastructure

Your Online Inference Has an On-Call Engineer. Your Batch Jobs Run at 2am Alone.

By Sam Hosseini·May 30, 2026·6 min read
Your Online Inference Has an On-Call Engineer. Your Batch Jobs Run at 2am Alone.

Every AI team knows what their chatbot is doing right now. Nobody knows what their batch jobs cost. That's the gap — and it's where a surprising amount of GPU budget quietly disappears.

Every AI team knows what their chatbot is doing right now. Response latency, token throughput, error rate — it's all on a dashboard somewhere, with an engineer on call if something breaks.

Nobody knows what their batch jobs cost.

That's the gap. And it's where a surprising amount of GPU budget quietly disappears.

---

Two Ways to Run an LLM

When most people think about LLM inference, they picture the real-time version: a user sends a message, the model responds in seconds, and if it's slow, someone notices immediately. That's online inference. It gets dashboards, SLAs, and pagers.

Batch inference is the other mode. No user is waiting. Instead of responding to one request at a time, the model processes thousands or millions of inputs together, on a schedule, usually overnight or between business hours.

The use cases are everywhere:

  • A legal firm runs 50,000 contracts through an LLM to extract key clauses before a merger
  • A financial institution processes yesterday's news through a model to flag market risks before trading opens
  • A recruiter scores 10,000 resumes against a job description
  • A retailer generates product descriptions for 500,000 SKUs
  • A compliance team classifies a million customer emails for regulatory review

Same LLM. Same GPUs. Completely different operational reality.

---

Why Batch Gets Ignored

Online inference gets attention because it has consequences. If latency spikes, users complain. If the model goes down, revenue stops. There's immediate feedback, and teams build systems to respond to it.

Batch inference has no such feedback loop.

A batch job that runs at 2am with poor GPU utilization doesn't page anyone. A job that requests 8 GPUs but only saturates 3 doesn't trigger an alert. A model that processes 1 record at a time when it could handle 64 simultaneously doesn't show up on any dashboard.

It just costs money. Quietly. Every night.

---

The Specific Ways Batch Wastes Money

Wrong batch size: This is the most common and least visible form of waste. LLMs can process multiple inputs simultaneously — that's the point of batching. But if the batch size is misconfigured, the model processes one record at a time on hardware designed for 64. The GPU cost is the same. The throughput is a fraction of what it should be.

Over-provisioned jobs: A batch job requests 8 GPUs because that's what the last engineer configured. It only ever uses 3. The other 5 sit idle for the entire job duration — metered, billed, wasted. Without job-level visibility, nobody knows.

Wrong GPU tier: A nightly classification job doesn't need an H100. An A10G would complete the same job at 40% of the cost. But without model-aware placement, teams use whatever hardware is available — not whatever hardware is right.

Inter-job idle time: Between scheduled jobs, GPUs sit empty. In Slurm environments this is especially common — scheduling gaps are built into workflows and nobody thinks to reclaim the capacity.

---

The Metric Nobody Is Tracking

Online inference teams track cost per request. It's a natural metric — every request maps to a user action and a latency commitment.

Batch inference teams track almost nothing at the job level. They see a monthly GPU bill. They don't see:

  • What each job actually cost to run
  • How long it took versus how long it should have taken
  • Whether the GPU tier matched the model's actual requirements
  • How much of the allocated capacity was used versus idle

Cost per job completed is the metric that matters for batch. Most teams have no way to produce it.

---

What This Means at Scale

The math compounds fast. A team running nightly batch jobs on a 100-GPU cluster with 30% waste is burning roughly $300K–$500K per year in recoverable GPU spend — conservatively. That's not a rounding error. That's an engineer's salary. That's a runway extension.

And unlike online inference, where waste is visible and urgency is high, batch waste accumulates in the dark. By the time it shows up as a line item worth investigating, months of spend have already gone.

---

Where Paralleliq Fits

The same model-aware intelligence that catches waste in online inference applies directly to batch workloads:

  • Tier misplacement — is this batch job running on the right GPU for what the model actually needs?
  • Utilization gaps — how much of the allocated capacity is actually being used per job?
  • Batch size detection — is the model processing inputs at its optimal throughput?
  • Job-level cost attribution — what did this job actually cost, in dollars, to complete?

The difference is that batch waste doesn't come with an alarm. It requires active detection — a scanner that looks at what's running, understands what the model requires, and surfaces the gap before it compounds another month.

That's exactly what piqc is built to do.

---

The Bottom Line

Your online inference has engineers watching it. Your batch jobs run at 2am with no one in the room.

That asymmetry is costing you more than you think. The first step is making the invisible visible — knowing what each job costs, whether the hardware matches the workload, and where the throughput is leaking.

Start with [piqc](https://github.com/paralleliq/piqc) — the open-source GPU waste scanner — or [reach out](mailto:info@paralleliq.ai) to discuss the full optimization layer for your fleet.

More articles

Get more from the cluster you already have.

Start for Free