May takeaway on Coding Agents and AI-for-AI:
the frontier is moving from "which base model is best?" to "who has the best training data, verifier, harness, and eval loop?"
5/ Harness became a core asset. LiteCoder-Terminal, unix-ctf, and agent failure taxonomies all point to the same bottleneck: better coding agents need better executable environments, not just better prompts.
4/ The data unit is changing. For agent SFT/RL, a "successful patch" is too coarse. The useful unit is the state transition: what the agent saw, tried, verified, failed, recovered from, and changed.
3/ RL post-training matured fast. May moved from rollout efficiency and cold-start tricks, to GRPO/credit assignment, to RL data selection: SAERL, label-free RL, verifier-free rewards, and offline RL for code.
2/ Eval is fragmenting. SpecBench, AgentLens-style trace diagnostics, FastKernels, TerminalBench-related work, DeepSWE, and SWE-bench Multimodal all point to one thing: final pass/fail is not enough.
1/ SWE-bench looked stable at the top in May. Public Verified scores hovered around 79%, while new work pushed toward harder evals: multimodal SWE, terminal tasks, production kernels, long-horizon agents, and safety.
"Long context is a solved problem." It isn't.
Context rot. Context anxiety. Lossy compression that compounds over hours.
I wrote about why the answer is to contine scaling — it's to push past 1M tokens.
https://t.co/8BR599To0C
"Long context is a solved problem." It isn't.
Context rot. Context anxiety. Lossy compression that compounds over hours.
I wrote about why the answer is to contine scaling — it's to push past 1M tokens.
https://t.co/8BR599To0C
Introducing MLS-Bench for machine learning science.
Auto research built on coding agents is undoubtedly another major market beyond SWE coding. It is harder and more challenging. However, we believe there are two different categories here. Auto research from @karpathy , MLE-Bench, and PostTrainBench are one type of attempt: engineering. Agents are asked to optimize a specific engineering objective, but we do not require them to produce transferable, generalizable behavior.
MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve a specific component of an ML system or algorithm, and to demonstrate that the improvement generalizes and scales under controlled settings. We find that current agents are still far from consistently outperforming human-designed methods, and that engineering-style tuning is much easier for them than genuine method invention.
@Lyubh22 is the lead of this project. I was deeply impressed by the way he used Discord to organize agent trajectories and share them with the team.
Paper: https://t.co/pr2DgBcQto
Code: https://t.co/mvoirci6o6
Website: https://t.co/jmJMwfgChL
Intelligence will not become a commodity.
Intelligence will not be too cheap to meter.
In the future, the real gap between people will be defined by their access to intelligence.
Just as a car lets an average person outrun any world champion, AI lets ordinary minds solve problems the greatest minds of the past could not.
Be prepared for this revolution.