the cost of shipping code went to zero
taste didn't
but "taste" sounds mystical and unfixable, so nobody teaches it. here's the unmystical version: taste is just an eval you haven't written down yet
how you choose what to measure is what matters
1/8
What does it cost to evaluate 100% of our agent runs?
If you're running an LLM-as-judge, the number that comes back is high enough that you end up sampling 10% and moving on.
But sampling 10% doesn't really make evaluation cheaper. The 90% you skipped still has broken outputs in it, you just don't see them until a user does. That's still a cost. It's not on the bill, it just shows up later on a different line, and nobody traces it back to the sampling call.
Here's what else doesn't show up on that bill: the engineer-months to build the pipeline, the time to keep it from rotting every time prompts change, the drift checks, the trace storage, the compliance audit when someone asks what you actually evaluated, the observability vendor charging you per span. Add it all up and "what evals cost" stops being a token line and becomes a total-cost-of-ownership number. Usually a much bigger, much more uncomfortable one.
So we built a calculator that counts all these parameters. Put in your volume, your sampling rate, and what one bad incident costs you, and it shows your real annual cost end to end - then runs the same all-in math across every option: frontier judge, self-host, human review, and our own eval model (TURING), which comes out around 99% cheaper per call.
The honest surprise for most people: evaluating 100% of your traffic costs less than you think. Don't take the number from me though, run it on your own volume and sampling rate, frontier judge vs TURING side by side - https://t.co/7jpwnFqkil
@HenryL_AI Based on the reasons weak models not improving as much - interesting if we can train small models to use harness/leverage it better? Since that’s the case for models trained for agentic tool calling, maybe SLM trained on using harness/tool calling is ideal for CS agents or such
@trajectorylabs Pretty cool stuff - though determining success/failure for external facing agents seems difficult. Customer gives a negative review because they didn’t get a refund they wanted cant be classified as a model failure probably. LLM to analyze the traces?
Very good advice on self-improving agents.
(bookmark it)
This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks.
What I have found is that stronger models do not always evolve better agents.
The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat.
New research shows that intuition is mostly wrong.
The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left.
This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side.
Paper: https://t.co/8kJwR7NhmV
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
You can find the paper here: https://t.co/mlbWeqwNdo
By Nanyang Technological University and 2Zhejiang University
Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, and Yang Liu
We see it in 3rd-party repos/skills, but it makes me wonder how long until it's a default in code agents.
CODESKILL (2605.25430) provides a method to automatically build and maintain a compact skill bank that boosts coding agent performance - no manual prompt engineering.
@serafimcloud Love the UI - will try myself. If you want to collab, happy to give you an API that drives https://t.co/smUibuZO54 - it backfills all arXiv data daily - but doesn’t have the updated benchmark data you have. Looking to find alternative after PwC
@Vtrivedy10 agreed - but imo the problem is not so much running the models on every arxiv paper (been there, done that) - but exactly the "connecting the dots" from high impact past the training cutoff... the model can't granularly connect paper X to product Y
@Vtrivedy10 Really cool concept - maybe if proprietary models grow in cost (which they probably will) - this becomes the sustainable future. How did you benchmark the performance of the smaller model?
@omarsar0 I love your work with dair and funny enough I recently built an MCP to query ArXiv: https://t.co/smUibuZO54
Since you're working on Dair, wondering if you'd want to trade notes on research agents
@victor207755822 This is cool work! I'm also trying to build an auto-research agent - but I started with the retrieval first: https://t.co/smUibuZO54
How do you keep track of what the agent has/hasn't seen and how papers relate to each other over an extended time?
1/ ~23,000 new AI papers hit arXiv in the last 2 months. I had AI rank all of them, clustered the top 250, and 3 themes jumped out that will make up the hot startups of 2027.
2)
The Defense Trilemma https://t.co/L73TwGIuyB
The Two Boundaries https://t.co/WOwp9QwlZT
The Granularity Mismatch https://t.co/hYva2BRNqS
Intent-to-Execution Integrity https://t.co/mM2srfK9Uc
Aligning Provenance with Authorization https://t.co/Ij7cYtACAY