Insight / signal

The AI boom has entered the prove-it phase

The useful AI story this week is not another model.

The useful AI story this week is not another model.

It is the gap opening between companies with proof and companies with theatre.

In the last couple of days, OpenAI has been pushing a very enterprise-friendly story around Codex. It was named a leader in Gartner’s 2026 Magic Quadrant for enterprise AI coding agents. It published a Virgin Atlantic case study saying Codex helped the airline hit a fixed mobile-app deadline with near-complete unit test coverage and zero P1 defects at launch. It also published a Ramp case study saying engineers are getting substantive pull-request feedback in minutes instead of hours.

Fine. These are vendor case studies, so we should not swallow them whole.

But notice the shape of the claims.

Deadline hit. Test coverage. Defect severity. Review speed. Real workflow, real constraint, real operational measure.

That is a very different species of AI claim from “we use AI now” or “our team is 10x” or “look at this agent doing a magic trick in a browser”.

Then, in the same 48-hour window, TechCrunch reported on AI startups stretching ARR beyond recognition. Contracted ARR called ARR. Annualised run-rate revenue treated like dependable recurring revenue. Investors apparently aware of the exaggeration, because once one company in a category starts inflating the number, everyone else feels pressure to keep up.

That is the other side of the market.

Not proof. Theatre with a spreadsheet.

And honestly, this is where AI starts to look less like a technology story and more like a business maturity test.

The demo phase has been fun. Useful, even. Everyone needed to see what these systems could do. Generate a landing page. Draft a proposal. Summarise calls. Write code. Search a repo. Build an app from a prompt.

But demos are cheap now.

The question is not “can AI produce something impressive in a controlled clip?” The question is: what changed in the business after you put it into the workflow?

Did the team ship faster?

Did quality improve?

Did support tickets drop?

Did sales follow-up get better?

Did the content produce qualified conversations, or just applause from other AI people?

Did the agent fail safely?

Did anyone check the logs?

This is where most AI adoption will get uncomfortable, because evidence is much less glamorous than possibility.

It is easy to say an AI agent saves time. It is harder to show the baseline, track the run, compare the output, log the mistakes, and admit where the human still had to step in.

It is easy to say AI content improves marketing. It is harder to prove it created pipeline rather than more beige noise on LinkedIn.

It is easy to say a startup has monster AI revenue. It is harder when someone asks what is actually contracted, what is used, what is recurring, what is gross margin, and what disappears if the customer stops experimenting next month.

That last bit matters for normal businesses too, not just venture-backed startups. The same temptation exists inside agencies, consultancies, software companies and internal teams. AI creates a lot of plausible-looking output. Plausible reports. Plausible dashboards. Plausible strategies. Plausible code. Plausible content calendars. Plausible automation demos.

Plausible is not the same as valuable.

This is why I think the next serious AI advantage is proof design.

Not prompt design. Not another model leaderboard. Proof design.

Before you automate a workflow, decide what evidence would convince a sceptical commercial director that it worked.

If it is a coding workflow, maybe that proof is review time, defect rate, test coverage, cycle time, rework, or deployment frequency. Not one magic number. A small set of measures that stop you lying to yourself.

If it is a marketing workflow, maybe it is qualified conversations, proposal requests, reply quality, cost per useful asset, conversion rate, sales-cycle movement, or which ICP actually engaged. Views can sit in the corner and behave themselves.

If it is an agent workflow, proof includes the boring stuff: what the agent accessed, what it changed, where it stopped, what it asked for approval on, what failed, what got retried, and whether a human could reconstruct the run afterwards.

This is why OpenClaw adding a Doctor/security tool is more interesting than it sounds. It is not as flashy as a new voice mode or a big model announcement. But diagnostics, checks and safe operation are the kind of features that matter when agents move from toy jobs into real business systems.

Google’s I/O announcements point in the same direction from a different angle. Managed Agents in the Gemini API can provision remote environments, call tools, execute code, manage files, browse the web and process live data. That is powerful. It is also exactly the sort of thing that needs evidence, boundaries and audit trails.

The more AI can do, the less acceptable it is to manage it with vibes.

That should be obvious, but the market still rewards theatre. Big demo. Big claim. Big revenue multiple. Big post about replacing half the team by Friday.

I get why it happens. AI is moving quickly, capital is chasing it, buyers are confused, and everyone is scared of sounding late. So people reach for the biggest possible story.

But the businesses that actually benefit from AI will probably sound a lot less dramatic.

They will say things like: we reduced first-draft turnaround from two days to two hours. We increased useful sales follow-up without letting AI send externally on its own. We cut code review wait time on standard changes. We found which content themes create buyer conversations, not just impressions. We now have an audit log for agent runs. We stopped doing three manual reporting jobs every week. We tried automating this bit and decided not to, because the failure mode was ugly.

That last one matters. Real AI maturity includes knowing what not to automate.

For agencies and marketing teams, this is the bit I think gets missed.

AI is not going to transform marketing because teams can generate more stuff. Most teams already generate too much stuff. The useful shift is when the commercial system becomes faster, more responsive and more evidence-led.

That means the content machine cannot just ask, “What can we publish?”

It has to ask, “What did this help us learn? Which buyer did it attract? Which sales conversation did it support? Which objection did it answer? Which workflow got faster? Which claim can we now prove?”

This also changes how AI services should be sold.

“We build AI agents” is already a soft offer. So is “AI-powered marketing”. Both can mean almost anything, which usually means the buyer has to do too much translation.

A stronger offer is narrower and more inspectable: we turn this repeated process into a supervised AI workflow. We define the baseline before we automate. We add approval points where judgement matters. We log what the system does. We measure whether it improved the business. We report the tradeoffs, not just the wins.

Less sexy. Much more useful.

This is also a better story for employees and clients. The worst AI pitch is still “we are going to replace everyone and call it efficiency”. It might get a founder excited for five minutes. It also makes teams defensive, buyers nervous and regulators interested. Not a great combination.

The better pitch is: we are removing drag from the system, making work easier to inspect, and freeing people from the repeatable bits so they can spend more time on judgement, relationships and commercial decisions.

But again, you have to prove it.

If you say AI saves time, show where. If you say it improves quality, define quality. If you say it creates revenue, show the path from output to buyer action. If you say your agent is safe, show the permissions, logs and failure handling.

If you cannot prove it yet, say that. Then run the experiment properly.

That is probably the biggest practical shift for the next year of AI adoption. Not less ambition. Less bullshit.

The companies that win will not be the ones with the loudest AI claims. They will be the ones whose claims can survive contact with a client, a CFO, a sceptical operator, or a messy Monday morning when the workflow breaks.

That is the post-agency opportunity too.

Do not just make more output.

Build the operating layer. Build the proof layer. Show the work.

Because AI buyers are going to get more sceptical, not less.

And that is a good thing.

It means the grown-ups might finally get a look in.