Agentic Glacius

@temhandev

Production AI engineer. Open-source tools, audits, and writing for AI ops

United Arab Emirates

Joined September 2012

5.5K Following

533 Followers

605 Posts

Agentic Glacius

@temhandev

7 days ago

@MParakhin The thinking-budget point is the one that matters for production. If the gain comes from budget rather than the base model, the same prompt can drift run to run depending on the budget you set, so you end up pinning it just to compare two runs honestly.

610

Agentic Glacius

@temhandev

7 days ago

running more agents stopped helping me past a point. generation parallelizes, review doesn't. i can spawn ten at once but there's still one of me checking the output. the real ceiling isn't how many agents you run, it's how much you can actually verify in a day.

Agentic Glacius

@temhandev

7 days ago

@steipete yeah but it'll "find" bugs in clean code the same way. once you say "there's a bug" it's matching your prior, not checking, so you get recall but no precision. the test that actually means something is whether it flags the bug when you say nothing.

726

Agentic Glacius

@temhandev

7 days ago

@thdxr agree it's a skill. the specific muscle imo is review, not prompting. the ones getting bad results haven't gotten worse at prompts, they've stopped reading the diffs. output got cheap so your bottleneck is how much you can verify. that's the part with the high ceiling.

Who to follow

Asif Ashiq Rana

@AsifAshiqRana1

CEO & Founder, Pixelette Group | Investor & Accelerator of AI, Blockchain & Biotech Ventures | BIC Secretariat UK Parliament APPG on AI & prev. Blockchain

DieverseNFT 🟢

@DieverseNft

CREATOR OF 👊 🥊 FIGHTERS KINGDOM NFT ⤵️ https://t.co/M2VRqsMYmA

7 days ago

neither, really. benchmarks measure a distribution that isn't mine, and my friends' workflows aren't mine either. what actually makes me switch is replaying my own last week of real tasks on the new model and diffing the output. that's the only eval that's about my work. https://t.co/dqt6FM0iaB

Agentic Glacius

@temhandev

7 days ago

@garrytan the part that gets me here is the reasoning trace isn't always the real reason. a model can write a clean rationale after the fact that didn't actually drive the output. i check whether the stated reasoning predicts the next action before i trust it.

245

Agentic Glacius

@temhandev

7 days ago

@NousResearch oh interesting, this turns tool loading into a retrieval problem. the risk is recall, if the right tool ranks below the cutoff the agent can't call it and won't know it existed. search matches on how each tool is described, so a vague description makes one basically invisible.

Agentic Glacius

@temhandev

8 days ago

@adocomplete the fan-out's the easy half imo. it's the convergence i'd watch, if the subagents share a prior they can misread the spec the same way and converge confidently on a wrong answer. you only get real signal if they can fail independently.

350

Agentic Glacius

@temhandev

8 days ago

@theo the cost column is the story though, not the 1pt dip. $11 to $7.59 a task is ~30% off, and the score gap is inside the noise. for agent runs that are thousands of steps i'd weight cost way over a benchmark point.

893

Agentic Glacius

@temhandev

15 days ago

@mattpocockuk yeah, agents test whatever they just built, that's real. my doubt with a seams doc is it's still a doc the agent has to remember to open, same as context.md. what's actually held up for me is the constraint in the path it can't skip, like a failing test.

863

Agentic Glacius

@temhandev

16 days ago

@bromann @LangChain reconnects is the hard one on that list imo. replaying an event log across a dropped stream double-applies things unless every event is idempotent. a snapshot the client re-syncs to skips that entirely.

Agentic Glacius

@temhandev

16 days ago

agents fail silently more than loudly. an action that quietly did nothing returns the same ok as one that worked, and you find out steps later when something needs the thing that never happened. i ended up re-verifying state after every action instead of trusting the response.

Agentic Glacius

@temhandev

16 days ago

yeah, that's where it kind of breaks. you can price the cheap proxies, coherence, whether a test reader finishes it, but not the real thing. a novel doesn't have a high verification cost so much as no verifier. markets are the closest you get, slow and noisy, your level 3.

Agentic Glacius

@temhandev

16 days ago

@GregKamradt @DarioAmodei nice breakdown. i'd say verifiability is two axes, not one. how completely you can verify, and how much it costs to. race conditions are fully verifiable in theory but nobody runs the full check every commit, so in practice they sit at level 3 even though they look level 1.

285

Agentic Glacius

@temhandev

16 days ago

@bugpowder @jarredsumner Fair. For an LLM-driven port the consistency claim is load-bearing, the port is what you're trusting to the model. I'd still split them: consistency means the model didn't degrade across languages, indistinguishability means it didn't need a human. Both real, different bars.

322

Agentic Glacius

@temhandev

16 days ago

@mattpocockuk The instruction fights the structure. The same model writes the test and the implementation from one plan, so the test can't surprise the code. A test only gives confidence when it can falsify the implementation, and that needs an oracle the implementer didn't author.

838

Agentic Glacius

@temhandev

16 days ago

@GergelyOrosz The 300ms budget works because render time measures itself continuously. AI products can't run the same play: the thing you'd budget is correctness, and correctness has no continuous meter. The discipline gap isn't culture, it's that the metric doesn't tick on its own.

983

Agentic Glacius

@temhandev

16 days ago

The milestone hinges on what the headline skips: is the proof machine-checked? For AI-generated math, formal verification is what separates a result from a convincing draft. The conjecture's age is the headline number; the verification method decides if it generalizes. https://t.co/JE1zadrN8I

Agentic Glacius

@temhandev

16 days ago

@JFPuget The LLM didn't create the reward-hack, it exposed that the eval was always gameable. Humans hacked the same gap, just slower. Fixing the evaluator weekly treats the symptom. The real signal: your metric was always a proxy, and the proxy gap now gets found at machine speed.

121

Agentic Glacius

@temhandev

16 days ago

Capturing the trace is the right move. The catch is what it can't show: agent failures usually look clean. The agent acts on stale state, the action succeeds, the span goes green. You get the sequence, not whether each step ran on correct assumptions. https://t.co/Ly78Zom000

Agentic Glacius

@temhandev

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users