Insight / signal

Cheap AI only works if you can prove the work

The most dangerous phrase in AI ops this week is not "frontier model".

The most dangerous phrase in AI ops this week is not “frontier model”.

It is “free forever”.

I have been running agents overnight lately. Small ones. Boring ones. Research triage, source scans, a bit of tidy-up. And the thing I have learned is that I care almost nothing about how confident they sound in the morning. I care about the evidence they leave behind. That gap, between confidence and evidence, is the whole story of where cheap AI is about to go wrong.

The last couple of days have been full of the usual local-model chatter. Run Hermes with local models. Use free OpenRouter lanes. Push Gemma through Ollama. Try lighter coding models. Build a command centre. Stop paying premium rates for every boring background task.

Some of that is genuinely useful. I want cheaper worker lanes. Anyone building real AI systems should want them.

But the bad version of this conversation is already obvious. People will hear “cheap” and quietly translate it to “safe to automate”. That is where the wheels come off.

The useful lesson is not that every business should run its marketing, sales ops, research and support through whatever model is cheapest this morning. The useful lesson is that AI work needs routing.

Use the expensive model where judgement matters. Use the cheap model where the job is bounded and testable. Use the local model where privacy, speed or cost make sense. Use a human when the decision carries commercial, legal, reputational or client risk. And above all, know how you are going to check the result.

That last line is the bit most businesses skip, because it is not sexy. Nobody wants to sell “verification evidence” on a webinar. It sounds like admin. It sounds like work.

Good. It is work.

The Hermes v0.18 release is useful here, and not because of the benchmark theatre. The serious part is the move towards agents proving they have finished. Completion contracts. Proof-of-work. Running checks. Learning repeatable procedures. Background subagents that fan out and report back. The language around it is a bit excitable, because this is still AI land and apparently we must all pretend every release changed civilisation before lunch. Underneath the noise, the direction is right.

“Done” cannot mean “the agent said it is done”. That is not enough once AI is touching real work.

OpenAI’s enterprise material points the same way from the other end of the market. The HP Frontier partnership is not framed as “use the cleverest model everywhere”. It is about context, permissions, deployment controls, evaluation, and moving from pilots into an operating model. Their Codex research says agents are taking on longer tasks, sometimes work that would have taken a person more than an hour, occasionally much longer than that. Big shift. It also makes the control problem more serious, not less.

If an agent does fifteen minutes of work, you can eyeball most of it. If an agent does eight hours of work in the background while you sleep, you need more than vibes and a cheerful summary.

Anthropic’s small business and finance announcements say the same thing in different packaging. The useful examples are not clever chat prompts. They are workflows inside tools people already use: QuickBooks, HubSpot, Canva, Excel, CRMs, market data feeds, document systems. Existing permissions hold. Approvals happen before anything sends or pays. Connectors, audit logs, review.

Boring words again. The actual moat again.

Here is where business owners need to be careful. Cheap AI makes it easy to run more attempts. More drafts, more checks, more research scans, more lead triage, more content variants, more daily summaries, more little agents doing little jobs. That sounds great right up until the business is full of invisible work nobody has looked at.

A cheap model can summarise a sales call badly. It can misclassify a lead. It can update a CRM field with quiet confidence and no idea what the field means. It can draft three hundred words of copy that are technically fine and commercially dead. It can decide a source is relevant because the headline looked convincing.

The problem is not that cheaper models are useless. They are not. The problem is that cheaper models make bad operating habits cheaper too.

So route by risk. If a task is low risk and easy to check, send it down the cheap lane. If it needs taste, synthesis, strategy or judgement, do not be a hero. Use the better model, a model panel, or a human review step. If it can touch money, customer trust, legal exposure, public claims, live systems or client delivery, put an approval gate in the path.

This is not anti-automation. It is how automation survives contact with the real world.

The practical stack for most businesses is going to look less like one super-intelligent assistant and more like a small workshop. A cheap worker does the rough extraction. A local worker handles private or repetitive preprocessing. A stronger model does synthesis and judgement. A deterministic workflow handles the steps that should never be improvised. A human approves the risky moves. Logs record what happened. Checks decide whether the work counts.

That is the operating layer. And this is exactly where the agency model changes.

If you are still selling “we make more content because AI is faster”, you are walking straight into commodity pricing. The client will ask why it still costs money if the machine wrote the first draft. Fair question. The answer cannot be “because we have better prompts”. Nobody cares.

The answer is that the value was never the raw output. The value is the system around it. What sources did it use. What was it allowed to touch. Which model handled which part. What got checked. What failed. What needed a human. What did we learn for next time.

At Cleo, that is the line I keep coming back to. The cheapest model that passes the checks is the right model. But if you do not have the checks, you are not doing AI cost control. You are just buying cheaper uncertainty.

So before you route anything to a cheaper or local model, get honest about ten questions. What exactly is the job. Is it low, medium or high risk. What sources is the model allowed to use. What output actually counts as acceptable. How will the system check the result. What happens when the check fails. When does it escalate to a stronger model. When does it escalate to a human. What gets logged. Who owns the workflow once it is live.

If those answers are vague, do not start with the automation. Start with the workflow.

That is the difference between using AI as a toy and building an AI operating system for real work.


Jason Sibley is the founder of Cleo, a post-agency marketing and AI company. JasonVsTheNoise is where he writes about what is actually happening with AI, marketing, and how businesses should be thinking about both.