Riashat Islam

@riashatislam

Research Scientist @ms_aifrontiers @MSFTResearch NYC; Ex @HUMAIN @DreamFoldAI PhD @Mila_Quebec, intern @MSRNYC @AppleMLR; RL, Reasoning and LLMs; WorldModels

New York, USA

Joined November 2016

1.3K Following

1.8K Followers

608 Posts

Pinned Tweet

Riashat Islam @riashatislam

about 3 years ago · Montréal

Excited to share that Agent-Controller representations for offline RL in presence of rich exogenous information is now accepted at #ICML2023 (https://t.co/XkLmbYUtDL) This is a follow-up of our recent work on latent state discovery (#TMLR'23) https://t.co/qmz0gGgopI

Riashat Islam @riashatislam

12 days ago

A huge loss for the RL, Control and Optimization communities. A legend passed way.

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

@rao2z

13 days ago · Tempe

Deeply saddened at the passing of my dear colleague, Dimitri Bertsekas. Everyone in RL, OR and control theory already knows of his monumental contributions. Over the past seven years, we at @SCAI_ASU also got to know him as an unwaveringly kind and gracious man of science. He truly enjoyed his research and has remained active all through; uploading two pre-prints to arXiv just in the past month! While I fully expected him to continue working for years to come, I also know that his contributions and books will be cherished by generations.. RIP 🙏

rao2z's tweet photo. Deeply saddened at the passing of my dear colleague, Dimitri Bertsekas.

Everyone in RL, OR and control theory already knows of his monumental contributions. Over the past seven years, we at @SCAI_ASU also got to know him as an unwaveringly kind and gracious man of science.

He truly enjoyed his research and has remained active all through; uploading two pre-prints to arXiv just in the past month!

While I fully expected him to continue working for years to come, I also know that his contributions and books will be cherished by generations..

RIP 🙏

281

29K

587

Riashat Islam @riashatislam

2 months ago

Great work @sumeetrm!

Sumeet Motwani

@sumeetrm

2 months ago

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

sumeetrm's tweet photo. We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens.

LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵 https://t.co/XZa90EokGO

406

272

141K

396

riashatislam retweeted

John Langford @JohnCLangford

2 months ago

LLMs can learn to manage their own context on some quite difficult thinking-required problems.

Who to follow

Marc G. Bellemare

@marcgbellemare

Modelling @ Cohere. Ex RL research lead at Google Brain, DeepMind. Textbook author. Co-founder, Reliant AI.

Ioannis Mitliagkas (Γιάννης Μητλιάγκας)

@bouzoukipunks

Associate prof. at the University of Montréal and Mila. Research scientist Google DeepMind. Previously Stanford; UT Austin.

David Abel

@dabelcs

Scientist @GoogleDeepMind and Fellow @EdinburghUni | RL, philosophy, agency, foundations, alignment | office hours: https://t.co/XrrHAXXHi5

riashatislam retweeted

Dimitris Papailiopoulos

@DimitrisPapail

2 months ago

https://t.co/lbjcGDxpJn

144

475K

Riashat Islam @riashatislam

4 months ago

Our work on Next Latent Prediction, led by @jayden_teoh_ is a step towards this new pre-training paradigm. Jayden did an amazing internship with MSR AI Frontiers! https://t.co/ZqYwAa7GP5

Jim Fan

@DrJimFan

4 months ago

https://t.co/Npar79SvUh

150

414

662K

Riashat Islam @riashatislam

5 months ago

Please apply to join our team at MSR AI Frontiers

John Langford @JohnCLangford

5 months ago

We are hiring for a researcher in the foundations of generative AI: https://t.co/AqYzvMjtJc . Please share with whoever may be interested.

204

154

22K

208

riashatislam retweeted

Anirudh Goyal @anirudhg9119

6 months ago

Think of orchestration as search over thoughts. Then train the model to match that orchestration. This allows shifting the accuracy↔latency/context tradeoff by compressing search into weights.

anirudhg9119's tweet photo. Think of orchestration as search over thoughts. Then train the model to match that orchestration. This allows shifting the accuracy↔latency/context tradeoff by compressing search into weights. https://t.co/LFGd44WxQ7

256

183

23K

Riashat Islam @riashatislam

6 months ago

@DimitrisPapail Congratulations! You now look American ;)

149

riashatislam retweeted

Michael Elabd

@MichaelElabd

6 months ago

Here are some research directions I enjoyed in #neurips (will compile some more soon!) Bootstrapping long‑horizon reasoning: Recent work [1, 2] shows we can train LLMs on short-step problems and curriculum them into much longer chains. By composing simple problems into multi-step tasks and using outcome-only rewards, models learned to solve much harder problems. This suggests an efficient path to scale deep reasoning, would love to see this scale outside of non-verifiable domains. Reward shaping and PRMs: To get better reasoning, we need to reward beyond basic task completion. Posterior-GRPO uses process-based rewards in code generation outperforming ORM-based RL [3], RL-Tango uses an LLM PRM that is co-trained with the generator to achieve SOTA on maths benchmarks [4]. ToolRL focuses on PRMs for tool usage [5]. RL on non-verifiable tasks: I saw a really nice transition from verifiable tasks (maths/code) to more open-ended objectives (dialogue, automation, etc). One interesting trend here is using offline RL for non-verifiable rewards and online RL for verifiable rewards [6]. Would have loved to see more work on online RL for non-verifiable rewards [7]. Science behind RL: There are a lot of interesting questions on what capabilities RL is illicting in LLMs. [8] questions whether RL is adding any more reasoning capacity to the base model. [9] examines mechanisms to actively elicit meta-cognition to overcome these limitations. Would love to see more critical examination of the science behind RL. [1] H1 by @sumeetrm, @philiptorr, @riashatislam, @sytelus, @casdewitt, @CharlieLondon02 [2] Reasoning Curriculum by @bo_pang0, @silviocinguetta, @CaimingXiong, @yingbozhou_ai [3] Posterior-GRPO by @MouxiangC, @Zhongxin_Liu [4] RL-Tango by @KaiwenZha, @ZhengqiGao, @maohaos2, @ZhangWeiHong9, @dina_katabi [5] ToolRL by @emrecanacikgoz, @qiancheng1231, @dilekhakkanitur, @tur_gokhan, @hengjinlp [6] Writing Zero (Not in NeurIPS) by @YunyiYang2 [7] JEPO by @robinphysics, @sidawxyz, @louvishh [8] Does RL incentive reasoning by @YangYue_THU, @RayLu_THU, @_AndrewZhao [9] ReMA by @raywzy1, @MarkSchmidtUBC, @seawan, @linyi_yang

269

527

51K

riashatislam retweeted

Manan Tomar

@manan_tomar

7 months ago

Microsoft Research NYC's AI Frontiers team (@JohnCLangford's group) is looking for Spring/Summer interns! Focus on self-supervised learning (jepa-style methods), latent world models, and rethinking VLMs. Interested in these topics? DM or email @Tea_Pearce, @riashatislam, or me!

213

225

19K

Riashat Islam @riashatislam

7 months ago

Policy mirror descent making an impact now! One of the early works led by @manan_tomar

idan shenfeld

@IdanShenfeld

7 months ago

Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is:

$IdanShenfeld's tweet photo. Everyone’s talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick that’s quietly becoming standard in frontier labs. Let’s break down what it is: https://t.co/OEHAlhQWHt$

469

480

59K

riashatislam retweeted

Sumeet Motwani

@sumeetrm

8 months ago

🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over complex interdependent problems is incredibly important, but we currently lack enough long-horizon reasoning data > Long-horizon problems are hard, which means training signal is sparse. We’d need a way to provide dense supervision Our solution composes existing short-horizon data to form a synthetic curriculum that keeps growing in complexity! This allows us to scale RL on the same dataset while avoiding saturation, with curriculum acting as dense rewards. At a small scale, we see massive in-domain long-horizon improvements, which transfer to significantly harder benchmarks. Training on composed 6th grade math problems leads to strong gains on AIME! 1/N🤿🧵

sumeetrm's tweet photo. 🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data?

Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡

> RL on existing datasets saturates very quickly
> Reasoning over complex interdependent problems is incredibly important, but we currently lack enough long-horizon reasoning data
> Long-horizon problems are hard, which means training signal is sparse. We’d need a way to provide dense supervision

Our solution composes existing short-horizon data to form a synthetic curriculum that keeps growing in complexity! This allows us to scale RL on the same dataset while avoiding saturation, with curriculum acting as dense rewards.

At a small scale, we see massive in-domain long-horizon improvements, which transfer to significantly harder benchmarks. Training on composed 6th grade math problems leads to strong gains on AIME! 1/N🤿🧵

294

215

78K

Riashat Islam @riashatislam

8 months ago

Excited to share this recent work on curriculum learning and long horizon reasoning! We show that long horizon reasoning can be improving by scaling RL with existing data only. Led by @sumeetrm, with folks from MSR AI Frontiers!

Shital Shah

@sytelus

8 months ago

Would you like to double your gains from RL with verifiable rewards for the same dataset? Our new paper lead by Alesia Ivanova and @sumeetrm proposes a simple trick! 🧵

sytelus's tweet photo. Would you like to double your gains from RL with verifiable rewards for the same dataset?

Our new paper lead by Alesia Ivanova and @sumeetrm proposes a simple trick! 🧵 https://t.co/di2HVhmduw

386

Riashat Islam @riashatislam

8 months ago

Work led by @sumeetrm along with an amazing team across Oxford, Princeton and Microsoft AI Frontiers. Full thread on this coming soon. Thanks for sharing our work @iScienceLuvr!

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

8 months ago

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning "In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multistep dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating."

iScienceLuvr's tweet photo. h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

"In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multistep dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating."

163

121

13K

riashatislam retweeted

John Langford @JohnCLangford

9 months ago

A new Dion draft https://t.co/0AMnTzK5h7 with a more comprehensive study of use and variations. (Code https://t.co/jaAEr0474r ) A new Belief State Transformer draft https://t.co/FS1uSFjrTA with variations for tractability at somewhat larger scale. (Code https://t.co/GtA2gd356k)

Riashat Islam @riashatislam

10 months ago

I got interested in ML through attending tea time talks from Gatsby Unit at UCL (as an undergrad, sometimes, admittedly, not understanding anything from the talk - but somehow people there seemed quite humble and eager to teach)

Richard Sutton

@RichardSSutton

10 months ago

The tradition of tea time talks started long, long ago, and came to Alberta from the Gatsby unit (Neuroscience) at University College London.

143

42K

Riashat Islam @riashatislam

10 months ago

@steph_milani @nyuniversity Congrats Stephanie! Would be great to have you come over at MSR NYC for collaborations with the group here.

246

riashatislam retweeted

Vaish Shrivastava

@VaishShrivas

10 months ago

Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress. We present Group Filtered Policy Optimization (GFPO):🧵 1️⃣ Sample more per prompt 2️⃣ Rank by token efficiency (reward ÷ length) 3️⃣ Train on top-k 4️⃣ 🚀 Cut 80% of excess length with ≥ GRPO accuracy on AIME, GPQA, LiveCodeBench & more For harder problems, Adaptive Difficulty GFPO learns to dynamically scale test-time compute—allowing more thinking only as needed. (1/12)

VaishShrivas's tweet photo. Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress.
We present Group Filtered Policy Optimization (GFPO):🧵
1️⃣ Sample more per prompt
2️⃣ Rank by token efficiency (reward ÷ length)
3️⃣ Train on top-k
4️⃣ 🚀 Cut 80% of excess length with ≥ GRPO accuracy on AIME, GPQA, LiveCodeBench & more
For harder problems, Adaptive Difficulty GFPO learns to dynamically scale test-time compute—allowing more thinking only as needed. (1/12)

353

303

72K

riashatislam retweeted

Ahmed Awadallah @AhmedHAwadallah

10 months ago

When we released Phi-4-Reasoning in May, we noticed that the model was generating unnecessarily long traces. Since then, the team has has been experimenting with different ways to mitigate this. GFPO was particularly interesting because it is very simple, effective and can be used for other objectives beyond conciseness. Thanks to the team for driving this forward, and we hope the community will find it useful.

Riashat Islam @riashatislam

10 months ago

@jpineau1 @cohere This is an amazing news! Congrats Joelle

104

Riashat Islam

@riashatislam

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users