Pretraining has scaling laws to guide compute allocation. But for RL on LLMs, we lack a practical guide on how to spend compute wisely.
We show the optimal compute allocation in LLM RL scales predictably.
โ Key takeaways below
RLVR has become the recipe for agentic post-training. But for Computer-Use Agents, the bottleneck is not the algorithm, it is the data. ๐
๐ We introduce CUA-Gym: a scalable, lightweight synthesis engine that turns arbitrary task queries into verifiable RLVR data for computer-use agents. The largest open CUA RLVR dataset to date:
๐ฏ 32,122 verifiable RLVR tasks with programmatic setup scripts + rewards
๐ 110 environments: 16 desktop apps + 94 synthesized mock web apps
๐ Qwen3.5-based CUA models trained with GSPO reach 72.6% on OSWorld-Verified and 56.6% on WebArena
๐ Paper: https://t.co/cdvHJPzgb1
๐ Homepage: https://t.co/kvhaOQxNx7
๐ค Dataset: https://t.co/w5vOIRdchR
๐ป Codebase: https://t.co/CcRlNTlS1c
๐งฉ Environments: https://t.co/fNZ6YAI8LD
๐งต[1/6]
๐ Introducing ๐๐ช๐ฎ๐ข๐ฅ๐ข๐๐ซ๐ข๐ฎ๐ฆ ๐๐๐๐ฌ๐จ๐ง๐๐ซ๐ฌ (๐๐ช๐) !
Feedforward models and weight-tied models behave very differently on hard reasoning generalization.
EqR pushes this difference to the extreme by learning ๐ญ๐๐ฌ๐ค-๐๐จ๐ง๐๐ข๐ญ๐ข๐จ๐ง๐๐ ๐ง๐๐ฎ๐ซ๐๐ฅ ๐๐ญ๐ญ๐ซ๐๐๐ญ๐จ๐ซ๐ฌ .
โข Sudoku-Extreme: 99.8%
โข Maze: 93%
#ICML2026
Excited to announce our tutorial: "Future of Work in the Age of LLMs" at #ACL2026 in San Diego, July 2! ๐ด
There's a lot of speculation about AI and the future of human work. Our tutorial unpacks it from four angles:
โ The landscape of human work
โ How to build LLMs to augment real-world workflows
โ How to evaluate these LLMs
โ The future of work with LLMs/LLM-based agents
Slow, heavy environments have been the real bottleneck for agentic RL. NanoRollout tackles it head-on with a clean rollout-as-a-service design, integrated with miles for scalable agent RL.
Great work from the team๏ผ
Slow, heavy environments have been the real bottleneck for agentic RL. NanoRollout tackles it head-on with a clean rollout-as-a-service design, integrated with miles for scalable agent RL.
Great work from the team๏ผ
Thrilled to see those promising numbers! ๐คฏ
Same finding on our end with NanoRollout: cross-scaffold generalization basically doesn't happen out of the box -- something the field should be talking about more.
Cool. I always enjoy playing with nano projects.
No matter who asks me how to learn LLM, my answer is always the same.
- Start with nanochat/nanogpt.
- Then pick one super niche direction.
- Deep dive into it. Build a nano project. Scale it gradually.
That's all.
The lack of light weight, open agent infra has been a massive pain point. This is a great starting point esp for large scale, thousands of parallel envs, battle tested coding / computer use agent training!
Nice work! Training digital agents isn't trivial, co-designing rollouts with targeted environments stands as the pain point once you dig into agentic RL. This is super clean and agentic RL folks should try this out.
Happy to release NanoRollout, our infra attempt to scale digital agent rollouts without pain. Setting up and scaling parallel digital agent envs is one of the biggest headaches in agent training / deployment. The open community hasn't handled it well.
Two appealing features from NanoRollout:
๐ Non-intrusive RL integration with frameworks such as miles, verl, tunix; validated end-to-end, e.g. outperforms DeepSWE-32B at a large 4k batch size ๐
๐งฉ Unified support across agent harnesses and envs โ covering SWE-Bench, Terminal-Bench, OSWorld, CocoaBench โ with fast parallel eval that reproduces published scores (e.g., full SWE-Bench Verified eval from 102 min โ 18 min, 5.7x fasterโก)
And the core logic is just ~900 LOC.
Hope NanoRollout helps if you're also trying to scale agent rollouts. Check out the tech blog and github for more details!
Big thanks to the fantastic co-lead @JunliWang2021
Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. ๐
๐ย We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads:
๐๏ธย Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B
๐งชย 250k+ distilled coding trajectories -> SOTA โค32B open coding agent
โกย Fast evaluation on coding/cua/unified agent -> finish
Check our Blog: https://t.co/IBNqqbLqra
Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. ๐
๐ย We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads:
๐๏ธย Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B
๐งชย 250k+ distilled coding trajectories -> SOTA โค32B open coding agent
โกย Fast evaluation on coding/cua/unified agent -> finish
Check our Blog: https://t.co/IBNqqbLqra
Today, weโre excited to launch Recursive (@recursive_si): an exceptional team across London and San Francisco, building AI systems that can safely improve their own capabilities over time.
A great piece on self-distillation using failures! Besides scaling up num of rollouts, actively scaling (extracting) signals from raw rollouts should be an important way to improve agents and save compute.
On-policy self-distillation is a promising direction for learning from rich textual feedback. But can it really learn from failed trajectories?
Our answer: not quite -- unless we let the model actively interpret them.
๐งต1/N