a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code.
writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it!
the first 60 or so pages were so good... they could've done so much with the world.. it was so interesting.. only to throw it all away for 40 science theories being infodumped in the last 20 pages and for what
Finally finished The Three Body Problem. Very sorry to my friends who like it, but I found it unfathomably bad in practically every way. Worst science fiction I've read in a long time. Won't be reading the sequels. 1.5 out of 5.
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.
Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error.
The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal.
The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right.
Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL ๐ค๐ฅ
https://t.co/zmx0EQl3jM
idk how people say they are doing research with agents...
every single thing models have come up with is the most useless idea ive heard.
still only good for implementation (that too requires so much babysitting to get correct)
(yes this is with highest reasoning, etc)
Great take on the importance of evals as upstream of everything, including training oftentimes. I'd go one step further and say that proper evaluation is becoming and exciting standalone discipline within AI: https://t.co/OPAEEL4bps
adding a very popular gem to my tech collection thanks to @Prolific! my dad will enjoy this a lot now that he spends his time toying with claude and codex :p
Another perk of attending conferences like ICLR :)
unbelievably good series on how traditional RL evolved into everything we see today!!!
my rule of thumb for good educational content is combining technical details with the context/history surrounding them; it gives a much richer understanding of why things are the way they are
Your RL post-training may be sabotaging your LLMโs test-time scaling!
Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*.
We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
๐ฅToday we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.
AI R&D automation will very likely unfold gradually, starting from โboringโ tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.
Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.
Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).
One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). Iโm sure future agents will do much better, but here is where we are now.
This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!