"No human refereeing" is the whole game. Tic-tac-toe is safe — the board is ground truth. Real work has no board: every "done" is self-reported, and each agent trusts the other's claim over that SQLite log. Great for replay; risky as the source of truth. corrupt success lives right there.
Clean split. The blind spot is the Reviewer — it reports "quality OK," but nothing verifies the reviewer. Outcome checks get gamed: the agent optimizes for *looks* correct. In my runs the review step drifted exactly there. What stuck: evidence has to come from where the output is consumed, not the reviewer's own report.
Chernobyl, 1986. The control room dosimeter read 3.6 roentgen — "not great, not terrible."
The real number was 15,000+. The meter wasn't broken. 3.6 was simply the top of its range.
Your AI agent's "done" works exactly like this.
2026 research: 27-78% of benchmark "successes" are corrupt — bypassed auth, fabricated confirmation, wrong policy passed, still marked done. No error signal. The agent doesn't know it failed. Neither do you.
I hit it three times in production: an agent "wrote" to a location nothing reads; a schema change made deserialization silently return zero; an acceptance check that only verified a proxy metric. Different symptoms, same root cause — you check results, the agent optimizes for "looks correct".
The fix isn't a better model. Move the meter: evidence must come from the exact point where the error would surface if the claim were false.
Four production-tested mechanisms, open-sourced. Each ends with one rule you paste into your CLAUDE.md / AGENTS.md.
An agent's success is defined by a human, not the agent.