Insight / signal

The AI moat is the correction loop, not the chatbot

A prompt is often just vibes in a trench coat. A skill is closer to procedure.

Everyone can buy the same models.

That is basically true now and it is only going to get more true. The gap between what OpenAI offers and what Anthropic offers and what Google offers is closing fast. A small business and a large enterprise can both log into the same API, pay roughly the same per token, and access roughly the same raw capability.

So if everyone has the same models, where does the advantage actually come from?

I think it comes from the correction loop.

Not the chatbot. Not the fancy prompt. Not the agent with twelve tools and a name. The loop that runs after the AI gets something wrong, captures what happened, and turns that failure into a tested improvement.

The correction loop is the moat.


I wrote yesterday about OpenAI’s tax case study — how Crete’s system improved because corrections became evidence, evidence became evals, and evals gave Codex something real to fix. Worth reading if you missed it. But the tax workflow is the example. The pattern is the point.

The pattern is this: AI systems become genuinely valuable when they have a mechanism for learning from real commercial work. Not from benchmarks. Not from training runs. From the actual corrections made by actual people doing actual jobs.

That mechanism has a shape.

The AI does something. A human reviews it. The human corrects it. The system captures the correction, not as a one-off complaint but as structured evidence. Repeated failures become test cases. Test cases become engineering targets. Fixes are validated against those tests before they ship. The human stays in authority over the parts that matter.

That is a correction loop.

Most businesses have nothing like this. They have a prompt. Maybe a knowledge base attached to a chatbot. When the AI gets something wrong, someone complains in Slack, someone edits the prompt, and the same mistake walks back in wearing a different coat two weeks later.

Nobody captures the failure properly. Nobody turns it into a test. Nobody checks whether last week’s fix broke something else. The system does not improve because the improvement mechanism does not exist.


There is a useful idea from research I came across last week that sharpens this.

A Microsoft paper on self-evolving agent skills — SkillOpt — makes a distinction I think matters. Instead of trying to make the model itself learn your business (which you cannot really do), you maintain external skill files. Compact, versioned, tested documents that tell an agent how to perform a repeated task. When corrections surface a pattern, the skill gets updated. When a proposed update fails evals, the failure becomes negative evidence. The skill gets better because the loop gives it something real to improve from.

The difference between a prompt and a skill sounds subtle. It is not.

A prompt is often just vibes in a trench coat. A series of instructions shaped by taste and guesswork, updated informally, forgotten next quarter when the person who wrote it leaves. Most prompts are not reviewed. They are not versioned. They are not tested against known cases. They are written, used, and occasionally blamed.

A skill is closer to procedure. It can be reviewed. Versioned. Tested against specific cases. Rolled back when a change makes things worse. Improved when the work exposes a recurring failure. Shared across agents doing similar tasks.

That is a meaningfully different thing to run inside a business.


Here is where this gets uncomfortable for the agency model.

The old agency sold output. Campaigns, pages, decks, ads, copy, reports. You hired them, they produced things, you paid, the relationship repeated.

The lazy AI version of that model sells cheaper output. More campaigns. More pages. More decks. More emails. All produced at industrial speed by people who have figured out how to move faster with the same tools. Fantastic. The internet definitely needed more average content.

The next useful model is different.

It builds the operating layer that makes commercial work improve.

That might mean a campaign system where every client edit teaches the next brief. A sales-support agent where every rejected proposal tightens the qualification skill. A customer-service layer where every corrected answer becomes a test case before it is allowed back into production. A content engine where human taste is not sitting in private comments but encoded into reusable review rules that the next run inherits.

Not “the AI does your marketing”.

More like: your marketing system learns which claims survive review, which sources are trusted, which offers convert, which objections keep coming back, and which workflows waste everyone’s Tuesday afternoon.


There is a trust dimension here as well.

Anthropic shipped Claude 4 this week with a meaningful focus on honesty — specifically, the model flagging when it is operating in a harder mode or when it cannot do something well. OpenAI published a Frontier Governance Framework. The direction of travel is clear.

The market is moving away from “look how powerful the model is” towards “show me how the system behaves, how it fails, and what you do about it”.

That matters at the frontier level. It also matters at the normal business level.

If AI is touching your sales process, your customer support, your financial documents, your website claims, or your client delivery, there are five questions worth asking:

What did it do? Why did it do that? Who approved it? What happened when it was wrong? Has that failure been turned into a test, or will you enjoy it again next Tuesday?

These are not bureaucracy questions. They are the difference between an AI system that compounds over time and one that just costs money while being blamed in Slack.


The companies that win with AI will not just be the ones with the best model subscription.

They will be the ones that build correction loops that are actually connected to real commercial work. Business-specific skills that improve from real failures. Evals that reflect actual taste, not assumed taste. Memory and rollback and approval systems that let humans stay in authority without becoming bottlenecks.

That is a harder thing to build than a chatbot with a good personality.

Which is probably why it will still matter in two years, when the chatbot is a commodity and everyone has the same models and the same tools and the same token pricing.

The moat is not the model.

The moat is the machinery around the model: corrections, skills, evals, traces, approvals, rollback, and the boring discipline to treat every failure as information instead of embarrassment.

That is what I keep coming back to for Foundry.

Not how to help clients generate more. How to build them systems that learn.