@MParakhin The thinking-budget point is the one that matters for production. If the gain comes from budget rather than the base model, the same prompt can drift run to run depending on the budget you set, so you end up pinning it just to compare two runs honestly.
running more agents stopped helping me past a point. generation parallelizes, review doesn't. i can spawn ten at once but there's still one of me checking the output. the real ceiling isn't how many agents you run, it's how much you can actually verify in a day.
@steipete yeah but it'll "find" bugs in clean code the same way. once you say "there's a bug" it's matching your prior, not checking, so you get recall but no precision. the test that actually means something is whether it flags the bug when you say nothing.
@thdxr agree it's a skill. the specific muscle imo is review, not prompting. the ones getting bad results haven't gotten worse at prompts, they've stopped reading the diffs. output got cheap so your bottleneck is how much you can verify. that's the part with the high ceiling.
neither, really. benchmarks measure a distribution that isn't mine, and my friends' workflows aren't mine either. what actually makes me switch is replaying my own last week of real tasks on the new model and diffing the output. that's the only eval that's about my work. https://t.co/dqt6FM0iaB
@garrytan the part that gets me here is the reasoning trace isn't always the real reason. a model can write a clean rationale after the fact that didn't actually drive the output. i check whether the stated reasoning predicts the next action before i trust it.
@NousResearch oh interesting, this turns tool loading into a retrieval problem. the risk is recall, if the right tool ranks below the cutoff the agent can't call it and won't know it existed. search matches on how each tool is described, so a vague description makes one basically invisible.
@adocomplete the fan-out's the easy half imo. it's the convergence i'd watch, if the subagents share a prior they can misread the spec the same way and converge confidently on a wrong answer. you only get real signal if they can fail independently.
@theo the cost column is the story though, not the 1pt dip. $11 to $7.59 a task is ~30% off, and the score gap is inside the noise. for agent runs that are thousands of steps i'd weight cost way over a benchmark point.
@mattpocockuk yeah, agents test whatever they just built, that's real. my doubt with a seams doc is it's still a doc the agent has to remember to open, same as context.md. what's actually held up for me is the constraint in the path it can't skip, like a failing test.
@bromann@LangChain reconnects is the hard one on that list imo. replaying an event log across a dropped stream double-applies things unless every event is idempotent. a snapshot the client re-syncs to skips that entirely.
agents fail silently more than loudly. an action that quietly did nothing returns the same ok as one that worked, and you find out steps later when something needs the thing that never happened. i ended up re-verifying state after every action instead of trusting the response.
yeah, that's where it kind of breaks. you can price the cheap proxies, coherence, whether a test reader finishes it, but not the real thing. a novel doesn't have a high verification cost so much as no verifier. markets are the closest you get, slow and noisy, your level 3.
@GregKamradt@DarioAmodei nice breakdown. i'd say verifiability is two axes, not one. how completely you can verify, and how much it costs to. race conditions are fully verifiable in theory but nobody runs the full check every commit, so in practice they sit at level 3 even though they look level 1.
@bugpowder@jarredsumner Fair. For an LLM-driven port the consistency claim is load-bearing, the port is what you're trusting to the model. I'd still split them: consistency means the model didn't degrade across languages, indistinguishability means it didn't need a human. Both real, different bars.
@mattpocockuk The instruction fights the structure. The same model writes the test and the implementation from one plan, so the test can't surprise the code. A test only gives confidence when it can falsify the implementation, and that needs an oracle the implementer didn't author.
@GergelyOrosz The 300ms budget works because render time measures itself continuously. AI products can't run the same play: the thing you'd budget is correctness, and correctness has no continuous meter. The discipline gap isn't culture, it's that the metric doesn't tick on its own.
The milestone hinges on what the headline skips: is the proof machine-checked? For AI-generated math, formal verification is what separates a result from a convincing draft. The conjecture's age is the headline number; the verification method decides if it generalizes. https://t.co/JE1zadrN8I
@JFPuget The LLM didn't create the reward-hack, it exposed that the eval was always gameable. Humans hacked the same gap, just slower. Fixing the evaluator weekly treats the symptom. The real signal: your metric was always a proxy, and the proxy gap now gets found at machine speed.
Capturing the trace is the right move. The catch is what it can't show: agent failures usually look clean. The agent acts on stale state, the action succeeds, the span goes green. You get the sequence, not whether each step ran on correct assumptions. https://t.co/Ly78Zom000