Insight / signal

Your AI will not improve unless the work leaves a trail

If your AI makes a mistake and nobody captures the correction, you have not built an AI system. You have built a recurring apology machine.

The most interesting AI story this week was not a new model.

It was a tax workflow.

That sounds boring. Good. Boring is usually where the useful stuff hides.

OpenAI published a case study about Tax AI, a system it built with Thrive Holdings and Crete’s network of accounting firms. The system helped prepare tax returns across a pilot of 7,000 returns. OpenAI says it saved practitioners about a third of their prep time, increased throughput by around 50%, and drafted returns with up to 97% accuracy.

Those numbers will get the attention. Fair enough. But the number is not the real lesson.

The real lesson is how the system improved.

It did not improve because someone sat there whispering better prompts into the machine. It did not improve because the model suddenly became magical overnight. It improved because the work left a trail.

Practitioners uploaded documents. The AI extracted fields. Accountants corrected what was wrong. The product captured what the AI predicted, what the human changed, and what eventually went into the filed return. Those corrections were not treated as one-off annoyance. They became structured evidence.

That is the bit most businesses miss.

They launch an AI tool, watch it get something wrong, patch the prompt, complain in Slack, and move on. A week later, the same mistake comes back wearing a different hat.

Nothing compounds because nothing is captured properly.

In the Tax AI example, repeated corrections became findings. Findings became eval targets. Codex could then inspect the trace, the repo, the extraction logic, the mapping rules, and the relevant skills. It could propose a fix. The system could run targeted evals and broader regression checks. Engineers still reviewed the change before anything shipped.

That is a very different thing from “we use AI”.

That is an operating loop.

The shape is simple enough:

  • the AI does real work;
  • the workflow records what happened;
  • the human corrects it;
  • the system captures the correction;
  • repeated failures become tests;
  • the agent proposes a bounded fix;
  • the fix is validated before release;
  • humans keep authority over the parts that matter.

None of that is glamorous. It is logging, product design, evals, review queues, approvals, versioning, rollback, and a fair bit of grown-up operational hygiene.

Which is exactly why it matters.

Most AI projects inside normal companies are still stuck at the assistant layer. Ask it to draft a thing. Ask it to summarise a thing. Ask it to answer a question from the docs. Useful, but fragile. When it fails, the failure often disappears into chat history, a copied output, or someone’s private frustration.

The business learns nothing.

That is the expensive part. Not the occasional wrong answer. The repeated wrong answer.

A better model might reduce the error rate. Fine. Use the better model. But if your workflow does not preserve evidence, you are still depending on luck and human memory. You cannot tell whether the system is improving. You cannot tell which failures matter. You cannot tell whether last week’s fix made this week’s edge case worse.

You are just vibing in production, which is a brave way to run payroll, tax, marketing, sales, customer support, or anything with a client on the other end.

The same pattern showed up elsewhere this week.

OpenAI’s Cisco case study was all about Codex inside serious engineering workflows: large repos, C/C++ codebases, compile-test-fix loops, security controls, governance, and human review. Warp talked about agent orchestration needing observability, coordination, memory, reproducible environments, and review. Remote told TechCrunch it grew revenue per employee by 50% after pushing AI across the company, with internal tools and custom workflows rather than a few executives playing with chatbots.

Different stories. Same direction.

The value is moving from “AI can generate output” to “AI is embedded inside a workflow that can be observed, corrected, tested, and improved”.

That distinction matters for marketing and agency work more than people think.

A lot of AI marketing still amounts to faster drafting. More posts. More emails. More campaign ideas. More versions of the same beige nonsense, now produced at industrial speed. If the only thing AI gives you is more output, you have not built much of an advantage. You have built a bigger slop tap.

The useful marketing system is different.

It remembers which source claims were approved. It records which hooks got edited and why. It tracks which buyer language came from real calls rather than made-up persona theatre. It captures when a sales follow-up missed the point. It logs why a campaign angle was rejected. It turns those corrections into better procedures.

Not just better prompts.

Better procedures.

Take a content workflow. The weak version is obvious: “Write me a LinkedIn post in our brand voice.” Maybe you add a style guide. Maybe you tell it not to sound like a conference brochure because you’re not a monster. The output improves a bit. Then a new person uses the workflow next week and the same unsupported claim slips back in.

The evidence-loop version asks better questions.

Where did the claim come from? Was it approved? Which part was edited by a human? Was the edit about tone, accuracy, buyer relevance, or legal risk? Did the final post perform with the audience it was meant for? Did sales use it? Did it generate a useful conversation, or just vanity approval from people who will never buy?

That is where AI starts becoming commercially useful.

Same with sales follow-up. Same with support bots. Same with reporting. Same with proposal generation. Same with campaign planning.

If the workflow captures the correction, you can improve the system. If it does not, the correction dies in someone’s head.

This is also where the post-agency model starts to become real.

The old agency sold outputs: decks, campaigns, copy, creative, reports, ads. Some of that still matters. Taste still matters. Strategy still matters. Distribution definitely matters. But output on its own is getting cheaper by the week.

The next agency does not win by promising more drafts.

It wins by building the operating layer around the work: research loops, campaign loops, content QA loops, sales response loops, reporting loops, support loops. Then it improves those loops from evidence.

That is a better offer for a business owner because it does not pretend AI is a magic employee. It treats AI like part of a system.

And systems need management.

Here is the buying question I would ask any AI supplier now:

When the AI gets something wrong, what happens next?

Not in a hand-wavy way. Specifically.

Can you show me the trace? Can you show me the correction? Can you separate a real failure from human preference or ordinary workflow noise? Can you group repeated mistakes? Can you create an eval from them? Can you test a fix before release? Can a human reject it? Can you roll it back?

If the answer is mostly vibes, you are not buying an AI system. You are buying a demo with a support burden attached.

This is why I am less interested in the phrase “self-improving agent” than the machinery underneath it. The phrase sounds like the agent goes away, thinks very hard, and becomes better by sheer digital willpower. That is not what the useful examples are showing.

The useful examples are much more practical.

Humans do the work. Humans correct the work. The product captures the correction. The system turns patterns into tests. Agents help investigate and propose fixes. Humans decide what ships.

Less sci-fi. More ops.

That is a good thing.

Because if AI is going to sit inside real businesses, it has to be more than impressive. It has to be inspectable. It has to improve without quietly breaking something else. It has to know when a case is ambiguous and route back to a person instead of forcing a confident answer.

This is the bit business owners should pay attention to now.

Not which model won the benchmark this week. Not which vendor has the loudest launch video. Not whether someone has found the perfect prompt.

Ask whether the work leaves a trail.

Because if it does, your AI can start learning from the business.

If it does not, you are just renting intelligence with amnesia.


Pull quotes

  • If your AI makes a mistake and nobody captures the correction, you have not built an AI system. You have built a recurring apology machine.
  • The business learns nothing when the correction dies in someone’s head.
  • The value is moving from AI output to AI workflows that can be observed, corrected, tested, and improved.
  • A better model might reduce the error rate. An evidence loop stops the same mistake coming back every week.
  • Output is cheap. Learning loops are the asset.