It took us many weeks and going through 36 production harnesses to realize that separation of concerns doesn’t fix AI agent self-evaluation.
Separate process, separate container, separate datacenter… if the context still reads as “teammate shipped work, I’m the reviewer, pipeline wants green,” the agent will do what every reviewer in the training set does in that context: soft approve with a minor note.
But why? Most of the text humans have ever written in that context IS a soft approval, so the model is just completing the pattern. Pretty correctly.
Which means the fix isn’t better instructions.
Every harness I studied that actually ships does the same underlying move, and guess, it’s not separation. It’s making the context describe a different room.
Beck’s TDD phase file works because the agent can no longer be in the room where “rewrite the test” is a valid action.
Prithvi’s Playwright evaluator works because clicking a running app is categorically different tokens than reading a diff - different tokens, different distribution, different room.
You can move the agent, what do you do with the room?
The builders who will serve the next billion people?
They're already building. 2-6 people. 200 agents. No job listing. No extra room.
You think you're early. You're not.
Every AI call is a bar fight.
Someone mumbles their account number while a dog barks in the background. They say “no wait that’s my old one” halfway through. They sneeze and your voice agent thinks they said “cancel.”
Coding agents get to think in peace and quiet.
People don’t take turns.
So why does every voice AI assume they will?
Code is async — think as long as you want. Voice is a live stream. The person is talking while you’re still thinking. They interrupt. They change direction mid-word. They don’t wait.
Every time a voice agent hallucinated, we added more guardrails.
Every time a coding agent hallucinated, they added more infrastructure.
Same panic. Different tools. One side had sandboxes and test suites. The other had a prompt editor.
There is no pytest for a phone call.
Coding agents get feedback loops. Voice agents get one second. That’s the whole gap between coding agents and voice agents.