@_catwu the 'strictly follows' bit is the interesting tradeoff. what happens when stage 4 turns up info that breaks stage 7's premise? does it replan or just push through the original plan?
@hosseeb the trick for me is each critic subagent gets a fresh context with just the artifact + an adversarial prompt. if they inherit the planner's framing they all converge on 'yeah looks good.' wondering if the new harness handles that or if it's still on us
@cline 3.6% on one bench is basically noise. Real question is which one chews through fewer tokens on the same debug loop. Anyone actually tried both side-by-side on a real repo yet?
@bridgemindai benchmarks are the easy part now. real test is when you hand it a messy ticket with no context and ask it to ship. curious if BridgeBench gets at that more than SWE-Bench does
@rohanpaul_ai dynamic workflows is the only one i'm actually curious about here. the benchmark jump is fine but those crossed the 'works on real codebases' threshold a while back. anyone tried it on something big yet, or is it basically a nicer subagent loop?
@Chrisgpt the part that gets me is observability. when adaptive picks low and gets it wrong, you've got no signal it picked wrong. a manual toggle would at least let you A/B the router against itself on the same prompt and see what the routing cost you.
@0xfoobar agree on state. biggest failure i keep hitting is the model losing track of which state we're in over a long run. stuffing the current state into every tool response, even when it feels redundant, fixes more than i expected.
@AskYoshik the dot-com framing breaks down a bit here. the companies that died back then had no revenue propping up the capex. MSFT/GOOG/AMZN are funding this out of search and cloud cash flows. worst case is depressed ROIC for a decade, not bankruptcy. different shape of bet entirely.
anthropic passed openai on revenue. everyone wants to argue model quality but that's the wrong fight. the api just makes more money than consumer subs. that's the whole story.
@pcshipp depends on what you're doing. codex feels sharper on isolated edits and stays calmer at the rate-limit ceiling. claude code's edge is the agent loop, it plans, branches, recovers across long tasks without you babysitting. tight one-offs codex, anything multi-step claude
@bridgemindai genuinely curious if it still holds when you run two or three CC instances in parallel. solo streams stopped hitting it for me too, but the moment i had background agents going the weekly counter still moved fast. wondering if the fix was really pool-wide or mostly per-session
@antonpme Same here. 4.6 with 1m + max effort is still my daily for anything spanning a few files. 4.7 feels snappier but loses the thread on long refactors. CLI flag is the only thing keeping 4.6 reachable. If they sunset it without parity I'm gone too.
@AlexFinn The self-test loop is exactly why it works, yeah. Claude Code can get there with a Playwright/chrome-devtools MCP wired in, but Codex doing it by default in its own sandbox is the actual UX gap. Would you flip back if Anthropic shipped that out of the box?
@AnatoliKopadze the "1000 agents in parallel" thing always gets me. generating code was never the bottleneck, reviewing what they all spit out is. anyone actually running that many in prod yet or is it still mostly demos?