For AI agents to scale beyond narrow tasks, they need to learn how the world works from their own interactions — in a self-supervised way, without relying on expert data or rewards.
We introduce Reinforcement World Model Learning (RWML), a self-supervised method that trains LLM-based agents to align their simulated next states with actual environment dynamics, improving robustness and adaptability across tasks.
Empirically, we find RWML directly boosts solve rate on ALFWorld and τ² Bench without expert data — and scales naturally to large RL settings.
For AI agents to scale beyond narrow tasks, they need to learn how the world works from their own interactions — in a self-supervised way, without relying on expert data or rewards.
We introduce Reinforcement World Model Learning (RWML), a self-supervised method that trains LLM-based agents to align their simulated next states with actual environment dynamics, improving robustness and adaptability across tasks.
Empirically, we find RWML directly boosts solve rate on ALFWorld and τ² Bench without expert data — and scales naturally to large RL settings.
[4/n] Takeaway: agents can learn more from their own experience than we often assume.
By aligning imagined next states with what actually happens — and doing so in a representation space that highlights meaningful changes — RWML turns every interaction into a learning signal and builds richer world understanding.
Why can (V)LMs agents ace coding and math, yet struggle so badly in more complex environments like computer or phone use? 🤔
We find that one key factor lies in models' ability to understand and *simulate* the environment’s dynamics — and propose **Dyna-Mind** to address this!
🧵[1/n]
Why can (V)LMs agents ace coding and math, yet struggle so badly in more complex environments like computer or phone use? 🤔
We find that one key factor lies in models' ability to understand and *simulate* the environment’s dynamics — and propose **Dyna-Mind** to address this!
🧵[1/n]
[4/n] Take-away: agents can learn not just from success, but from *every* environment interaction to enhance their reasoning/acting, if used correctly💡
Can we ever truly trust foundation models—and if so, how?
Our ICCV TrustFM workshop (https://t.co/lnhouGqc1L) is now accepting submissions (deadline: 8/1, attending: 10/19-10/23, Hawai'i)
Submit, attend, and learn from everyone around the world who is making FMs more trustworthy
Co-organized with Zeliang Zhang, @ChenliangXu, @huang_chao_cs, @MuCai7, @DrogoKhal4, @jd92wang, @hendrycks, @jure, etc.
@trustworthy_ml@StanfordAILab
🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf
This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics, and human-agent interaction.
Non-archival submission deadline: June 23, 2025!
🌐Website: https://t.co/Zyghrm7wDW
📜OpenReview: https://t.co/OkApJbtBxN
👥Program committee/Reviewer sign-up form: https://t.co/2onctOrhsA
✉️Mailing list: [email protected]
📅WikiCFP: https://t.co/y0sBiJFpdb
Shout out to our amazing organizing team members @Zhou_Yu_AI@ysu_nlp@uiuc_aisecure @ Yi Zhang @ Baolin Peng @ Jim Zhiwei Liu @DanRothNLP@LyleUngar@sharathguntuku@yugu_nlp@xy2437@jeffrey_ch0@AnnieFeng6@Haoyu_Wang_97@ZRChen_AISafety@raphaelshu@yooli23
Thanks @MSFTResearch for all the support! Excited to share that our work is also accepted by #ICLR2025!
- Paper: https://t.co/SBWBOxoivW
- Code: https://t.co/agV471v8U5
- Website: https://t.co/5iiUx0kTTz
ExACT combines Reflective-MCTS and Exploratory Learning to improve AI agents' decision-making, enabling test-time compute scaling. Learn how these methods help agents refine strategies for state-of-the-art performance and improved computational efficiency: https://t.co/GUhzuX9NuQ
To effectively solve modern computer tasks, AI agents need to be able to strategically explore the environment and efficiently learn from past interactions.
We present R-MCTS and Exploratory Learning for building o1-like models for agentic applications. Our GPT-4o powered R-MCTS agent creates SOTA performance on VisualWebArena. Notably, R-MCTS and Exploratory Learning (without MCTS) demonstrate the compute scaling properties in both training and testing time!
🌐: https://t.co/5UIDKzi3Ie
To effectively solve modern computer tasks, AI agents need to be able to strategically explore the environment and efficiently learn from past interactions.
We present R-MCTS and Exploratory Learning for building o1-like models for agentic applications. Our GPT-4o powered R-MCTS agent creates SOTA performance on VisualWebArena. Notably, R-MCTS and Exploratory Learning (without MCTS) demonstrate the compute scaling properties in both training and testing time!
🌐: https://t.co/5UIDKzi3Ie
Part 2.2) In our experiments with VisualWebArena, we find that:
- Exploratory Learning boosts performance: GPT-4o fine-tuned with Exploratory Learning improves performance, even without search, similar to the effects of scaling test-time compute with MCTS.
- Exploratory Learning enables test-time compute scaling: The fine-tuned GPT-4o exhibits better performance when allowed more actions per task, enhancing decision-making and task completion.
- Exploratory Learning improves generalization to unseen tasks: The fine-tuned GPT-4o demonstrates improved performance on unseen tasks compared to no-training/imitation learning on best actions.