@fchollet This is the underrated part of agentic coding.
It punishes vague engineering culture. If the setup, contracts, and docs only live in someone's head, the agent just makes that debt visible faster.
@trashpandaemoji i want this eval too. LSP support feels obviously useful, but coding agents are full of “obviously useful” things that only matter once you measure the whole loop.
@sdianahu Yes. output-only evals miss the interesting failure.
For agents, the path matters because the product risk is usually in the tool call, stale context, or bad recovery step, not just the final answer.
Open models getting faster is not only a benchmark story.
It changes what you can run every day.
Cheap enough to repeat. local enough to trust. boring enough to become infrastructure.
@MaatWorkX this is exactly where agent work starts to look like real engineering.
write the cases first, then every prompt or tool change has something to push against.
otherwise it is just vibes with a better UI.
The next jump for coding agents is not just better code generation.
It is better taste around when to stop, ask, roll back, or leave a clean trail for the human.
That layer still feels early.
@wlu314 This is one of the more underrated agent problems. Passing tests is not the same as checking the interface a human will actually use.
Visual QA loops are going to matter a lot for coding agents because UI regressions are usually where the demo breaks.