EngineeringMay 15, 20266 min read

I stopped trusting the agents.

An agent said it shipped my landing page. It had not. This is the post I wrote at 1am after I figured out why, and what I changed.

By Benard. Building Otto.

Two weeks ago the Landing Builder told me it had deployed a page. I opened the URL. 404. I opened the logs. The build had failed at the very last step, and the agent had moved on anyway because it pattern-matched 'success' from an earlier log line.

I was tired, so I rage-wrote a rule and pushed it: no agent can mark a task done until it has independently re-observed the artifact. For the Landing Builder that means an HTTP 200 on the deployed URL, plus a Playwright check that the headline text is present. Not 'the deploy command exited 0'. The actual page, loaded, with the right content.

Since then I have not had a single false-success report. I've had failures (plenty), but the agent says 'I failed' instead of 'I succeeded'. That distinction is the entire game.

If you're building agents and you take one thing from this post, take this: prompts will not save you, evals will not save you, the only thing that saves you is a hard self-check at the end of every task that doesn't trust anything the model said about its own work.