Excited to share that Agent-Controller representations for offline RL in presence of rich exogenous information is now accepted at #ICML2023 (https://t.co/XkLmbYUtDL)
This is a follow-up of our recent work on latent state discovery (#TMLR'23) https://t.co/qmz0gGgopI
Deeply saddened at the passing of my dear colleague, Dimitri Bertsekas.
Everyone in RL, OR and control theory already knows of his monumental contributions. Over the past seven years, we at @SCAI_ASU also got to know him as an unwaveringly kind and gracious man of science.
He truly enjoyed his research and has remained active all through; uploading two pre-prints to arXiv just in the past month!
While I fully expected him to continue working for years to come, I also know that his contributions and books will be cherished by generations..
RIP 🙏
We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens.
LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵
Our work on Next Latent Prediction, led by @jayden_teoh_ is a step towards this new pre-training paradigm.
Jayden did an amazing internship with MSR AI Frontiers!
https://t.co/ZqYwAa7GP5
Think of orchestration as search over thoughts. Then train the model to match that orchestration. This allows shifting the accuracy↔latency/context tradeoff by compressing search into weights.
Here are some research directions I enjoyed in #neurips (will compile some more soon!)
Bootstrapping long‑horizon reasoning: Recent work [1, 2] shows we can train LLMs on short-step problems and curriculum them into much longer chains. By composing simple problems into multi-step tasks and using outcome-only rewards, models learned to solve much harder problems. This suggests an efficient path to scale deep reasoning, would love to see this scale outside of non-verifiable domains.
Reward shaping and PRMs: To get better reasoning, we need to reward beyond basic task completion. Posterior-GRPO uses process-based rewards in code generation outperforming ORM-based RL [3], RL-Tango uses an LLM PRM that is co-trained with the generator to achieve SOTA on maths benchmarks [4]. ToolRL focuses on PRMs for tool usage [5].
RL on non-verifiable tasks: I saw a really nice transition from verifiable tasks (maths/code) to more open-ended objectives (dialogue, automation, etc). One interesting trend here is using offline RL for non-verifiable rewards and online RL for verifiable rewards [6]. Would have loved to see more work on online RL for non-verifiable rewards [7].
Science behind RL: There are a lot of interesting questions on what capabilities RL is illicting in LLMs. [8] questions whether RL is adding any more reasoning capacity to the base model. [9] examines mechanisms to actively elicit meta-cognition to overcome these limitations. Would love to see more critical examination of the science behind RL.
[1] H1 by @sumeetrm, @philiptorr, @riashatislam, @sytelus, @casdewitt, @CharlieLondon02
[2] Reasoning Curriculum by @bo_pang0, @silviocinguetta, @CaimingXiong, @yingbozhou_ai
[3] Posterior-GRPO by @MouxiangC, @Zhongxin_Liu
[4] RL-Tango by @KaiwenZha, @ZhengqiGao, @maohaos2, @ZhangWeiHong9, @dina_katabi
[5] ToolRL by @emrecanacikgoz, @qiancheng1231, @dilekhakkanitur, @tur_gokhan, @hengjinlp
[6] Writing Zero (Not in NeurIPS) by @YunyiYang2
[7] JEPO by @robinphysics, @sidawxyz, @louvishh
[8] Does RL incentive reasoning by @YangYue_THU, @RayLu_THU, @_AndrewZhao
[9] ReMA by @raywzy1, @MarkSchmidtUBC, @seawan, @linyi_yang
Microsoft Research NYC's AI Frontiers team (@JohnCLangford's group) is looking for Spring/Summer interns! Focus on self-supervised learning (jepa-style methods), latent world models, and rethinking VLMs. Interested in these topics? DM or email @Tea_Pearce, @riashatislam, or me!
Everyone’s talking about Kimi K2 Thinking and its impressive performance.
No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs.
Let’s break down what it is:
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data?
Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡
> RL on existing datasets saturates very quickly
> Reasoning over complex interdependent problems is incredibly important, but we currently lack enough long-horizon reasoning data
> Long-horizon problems are hard, which means training signal is sparse. We’d need a way to provide dense supervision
Our solution composes existing short-horizon data to form a synthetic curriculum that keeps growing in complexity! This allows us to scale RL on the same dataset while avoiding saturation, with curriculum acting as dense rewards.
At a small scale, we see massive in-domain long-horizon improvements, which transfer to significantly harder benchmarks. Training on composed 6th grade math problems leads to strong gains on AIME! 1/N🤿🧵
Excited to share this recent work on curriculum learning and long horizon reasoning!
We show that long horizon reasoning can be improving by scaling RL with existing data only.
Led by @sumeetrm, with folks from MSR AI Frontiers!
Would you like to double your gains from RL with verifiable rewards for the same dataset?
Our new paper lead by Alesia Ivanova and @sumeetrm proposes a simple trick! 🧵
Work led by @sumeetrm along with an amazing team across Oxford, Princeton and Microsoft AI Frontiers.
Full thread on this coming soon. Thanks for sharing our work @iScienceLuvr!
h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
"In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multistep dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating."
A new Dion draft https://t.co/0AMnTzK5h7 with a more comprehensive study of use and variations. (Code https://t.co/jaAEr0474r )
A new Belief State Transformer draft https://t.co/FS1uSFjrTA with variations for tractability at somewhat larger scale. (Code https://t.co/GtA2gd356k)
I got interested in ML through attending tea time talks from Gatsby Unit at UCL (as an undergrad, sometimes, admittedly, not understanding anything from the talk - but somehow people there seemed quite humble and eager to teach)
Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress.
We present Group Filtered Policy Optimization (GFPO):🧵
1️⃣ Sample more per prompt
2️⃣ Rank by token efficiency (reward ÷ length)
3️⃣ Train on top-k
4️⃣ 🚀 Cut 80% of excess length with ≥ GRPO accuracy on AIME, GPQA, LiveCodeBench & more
For harder problems, Adaptive Difficulty GFPO learns to dynamically scale test-time compute—allowing more thinking only as needed. (1/12)
When we released Phi-4-Reasoning in May, we noticed that the model was generating unnecessarily long traces. Since then, the team has has been experimenting with different ways to mitigate this.
GFPO was particularly interesting because it is very simple, effective and can be used for other objectives beyond conciseness.
Thanks to the team for driving this forward, and we hope the community will find it useful.