24 hours of Fable 5 auditing the app I've spent 4 months building with Opus 4.5 and 4.6 - 553K LOC, ~4,500 tests.
I spec every feature. I run adversarial reviews and blind agent reviews on every implementation round.
It still found 10 P0s.
The worst: my AI code reviews were approving code they couldn't actually read. And 30 test failures were invisible because the gate command never ran them.
Code generation is solved. Verification isn't. I'm building for exactly that - verification that doesn't depend on trusting any single model.
@naman_jain28@arvidkahl 100% - I run both now - failing test first, then a second set of eyes on the diff that didn't write the code. Different failure modes.
After having Fable access for a few days, adjusting workflows to maximize its capability - then having to revert to Opus (seeking the same quality results), my usage is higher than the workflows I developed with Fable orchestrating. Anyone else finding the same?
Hey Fable, why does my cat lick its toes? Model not available? Hmmm. Ok. /ultracode why does my cat lick its toes? Yes, I need a /codex:adversarial on that.
Inverse Aesop moral as well: a watchdog that never barks. I audited 4 months of my AI-written code with Fable this week and the biggest issue wasn't the bugs - it was that my multi-model AI reviews had been approving code they couldn't parse. 30 test failures were invisible because the gate command never ran them - green checkmarks the whole way. The loop compounds only if the review step actually reviews.
@clairevo The Fable design critique matched my experience too - until I started front-loading the design direction: described the aesthetic in a DESIGN.md first, then had it build an HTML-only prototype from that before any real code. Made a night-and-day difference for me.
@petergyang Ha! Well captured. For the Haiku card - maybe a glass-cannon speedster, low intelligence and mana - and a local LLM card as the free-to-play character you grind 40 hours to do what the party does in 5.
Second agent helps, but I've found that a second agent from a different model family is the real unlock. I run Claude + Codex against the same diff and each consistently catches real bugs the other misses - same family reviewers share the same priors, so they share the same blind spots.
@danshipper@every Agreed - possibly the best code READING model. I pointed it at 4 months of Opus-written code - 553K LOC, ~4,500 green tests - and it found 10 P0s. Pricey, so, using Fable for architectural audits and orchestration. Will be tough to want to change back on June 22.
Great write-up. I ran the inverse experiment this week: had Fable audit 4 months of code I built with Opus 4.5/4.6 - 553K LOC, ~4,500 tests. It found 10 P0s, including evidence of my prior AI reviewers approving code they couldn't actually parse. Same experience on speed and cost - slow, and it tore through my usage limits. The firepower was substantial though. TBD at sustained prices.
24 hours of Fable 5 auditing the app I've spent 4 months building with Opus 4.5 and 4.6 - 553K LOC, ~4,500 tests.
I spec every feature. I run adversarial reviews and blind agent reviews on every implementation round.
It still found 10 P0s.
The worst: my AI code reviews were approving code they couldn't actually read. And 30 test failures were invisible because the gate command never ran them.
Code generation is solved. Verification isn't. I'm building for exactly that - verification that doesn't depend on trusting any single model.