look, small study: n=5, one repo, directional. but all 210 diffs are public so you can break it yourself without paying for a single call.
here π https://t.co/qS0iawB9ae
if you find where it falls apart, i owe you one.
took the best model that exists, cranked it to max reasoning, gave it a coding task.
build green. tests passing. agent says "done."
and it was wrong 2 out of 3 times.
nobody would have noticed. that's the trap.
it's that an LLM isn't deterministic even at temp 0. every run is a roll, like me on a sharp day vs a foggy one.
the good run and the broken one look the same. you only tell them apart by executing. green can't tell you which one you got.
@trikcode the code was clean, that's the trick. it's not a quality problem. the model's locally correct every step, but nobody's keeping the architecture coherent across sessions. it optimizes the next token, not the system. that's the rewrite trap, not a skill issue.
@thsottiaux the top complaints aren't model stuff, they're state. worktrees getting lost, remote dropping, big files nuking context. model's fine. it's that every task starts from zero and nothing carries between sessions. that's where it actually hurts.
@irl_danB naming it matters but not for philosophical reasons. if everything's "vibe coding" you can't measure which part failed. context, spec, model, verification. you need the subdivisions or there's nothing specific to improve.
@mfishbein honestly extraction is the easy half. the hard part is turning that context into specs the agent can actually fail against. if it's not formalized you'ven still got an engineer reviewing every diff by hand. that's the bit that doesn't scale.
@CopyRebeldia measured this. without a spec, 6 AI CLIs fail genuinely (crash, empty output, silently wrong) 37% of the time. and everything compiles green btw, thats the trap. spec + real execution check drops it to 4%. praying was never a methodology lol. data: https://t.co/bGYS6tEqf9
A green build is not a correct feature.
I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug β a crash, an empty response,
or nothing at all.
A spec + a verification gate dropped it to 4%. π§΅
@mardehaym "confidently wrong" nailed it. i benchmarked 6 AI CLIs, same task: 37% of runs ship a genuine bug but compile clean. green across the board. no spec + no execution gate = russian roulette. with both drops to 4%. data here: https://t.co/bGYS6tEqf9
A green build is not a correct feature.
I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug β a crash, an empty response,
or nothing at all.
A spec + a verification gate dropped it to 4%. π§΅
Honest: n=5, one repo, directional β not a paper. Interface nitpicks excluded; only real correctness bugs counted.
Full breakdown + every number:
β https://t.co/1Qq2MBzTFZ
A green build is not a correct feature.
I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug β a crash, an empty response,
or nothing at all.
A spec + a verification gate dropped it to 4%. π§΅
The part I didn't expect: a spec + gate makes cheap models safe.
Kimi raw: 1/3. Kimi + spec: 3/3 β at ~$0.03/task.
Sonnet raw: 2/5 β at $0.45.
A model 10β50x cheaper, gated, beats the frontier model run on trust. You verify instead of trust.