Bo Liu (Benjamin Liu)

@Benjamin_eecs

RL PhD | Undergrad @PKU1898 | Building autonomous decision making systems | Prev @deepseek_ai @AIatMeta FAIR | DeepSeek-V2/VL/Prover SPIRAL SPICE

Singapore

Joined February 2022

457 Following

815 Followers

192 Posts

Pinned Tweet

Bo Liu (Benjamin Liu)

@Benjamin_eecs

11 months ago

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: https://t.co/D7b9u4wqSg 🧑‍💻 Code: https://t.co/TiYyrU8UfH

Benjamin_eecs's tweet photo. We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies.
We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏
🔗 Paper: https://t.co/D7b9u4wqSg
🧑‍💻 Code: https://t.co/TiYyrU8UfH

277

186

72K

Benjamin_eecs retweeted

Ian Stewart-Binks

@binks_stewart

14 days ago

Project Genie is magical... but we've also been working on some new ways to interact with another player (or agent). It was super fun to demo this new capability at Google I/O this week, where we enabled attendees to explore worlds with Gemini as a companion. Going forward, we are incredibly excited to see how this can enable Gemini to learn how to interact with humans in embodied environments. Some examples of interacting with Gemini in real-time within these generated worlds:

32K

Benjamin_eecs retweeted

OpenAI

@OpenAI

16 days ago

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

27K

14M

Benjamin_eecs retweeted

Yu Su

@ysu_nlp

17 days ago

nice work by @DimitrisPapail and @VaishShrivas! this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone. we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (https://t.co/5uxJWO8b4m). also some interesting link to on-policy self-distillation. shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

ysu_nlp's tweet photo. nice work by @DimitrisPapail and @VaishShrivas!

this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone.

we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (https://t.co/5uxJWO8b4m). also some interesting link to on-policy self-distillation.

shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

205

194

26K

Who to follow

CAMEL-AI.org

@CamelAIOrg

https://t.co/FmX1B3nzjA is working on finding the scaling laws of agents. The first and the best multi-agent framework. Discord: https://t.co/DRweXf0nOl. Product @Eigent_AI

Abe Hou

@abe_hou

PhD student at @stanfordnlp @stanford. Undergrad at @jhuclsp.

DataVoid

@DataPlusEngine

Independent ML researcher. The First step in knowing is admitting you don't

Benjamin_eecs retweeted

Richard Sutton

@RichardSSutton

18 days ago

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

136

975

571K

Benjamin_eecs retweeted

Tim Rocktäschel

@_rockt

23 days ago

Excited to co-found Recursive (@recursive_si) with an exceptional team in London and SF to create AI that experiments on how to safely improve itself, turning compute into knowledge that accumulates in an open-ended process of endless, automated scientific discoveries.

905

112

227

251K

Benjamin_eecs retweeted

John Schulman

@johnschulman2

25 days ago

Sharing our work on full-duplex multimodal models -- real-time interaction that's natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to intelligence/autonomy because they're harder to eval. In the future, we think every AI system will have something like an interaction model as the outer user-facing layer, continually keeping the user informed and learning what they actually want.

926

182

123K

Benjamin_eecs retweeted

Thinking Machines

@thinkymachines

25 days ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

462

16K

12K

Bo Liu (Benjamin Liu)

@Benjamin_eecs

about 1 month ago

@mickel_liu lets chat

134

Benjamin_eecs retweeted

Jason Weston

@jaseweston

about 1 month ago

💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: https://t.co/vjPvnTYfJx

jaseweston's tweet photo. 💎Autodata: an agentic data scientist to create high quality data✨

We introduce a method for building agents that create high-quality training & evaluation data.

Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*.

We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data.

Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods.

Overall, we believe this direction has the potential to change how we build AI data!

Read more in the blog post: https://t.co/vjPvnTYfJx

615

103

684

42K

Benjamin_eecs retweeted

Ineffable Intelligence @IneffableLabs

about 1 month ago

Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. https://t.co/zUuvPJGmcq

IneffableLabs's tweet photo. Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us.

https://t.co/zUuvPJGmcq https://t.co/pkmwDkJWbt

158

622

349K

Benjamin_eecs retweeted

Jason Weston

@jaseweston

about 1 month ago

DeepSeek-V4 uses our Hash routing approach developed back in 2021 -- see screenshot of their tech report! (Looks like a great model, congrats!) Bonus note: our same blogpost (& paper) back in 2021 also introduced 'looped transformers', but we called that staircase & ladder (see screenshot): https://t.co/widkeEXz56 https://t.co/PQLdPKg9PS

jaseweston's tweet photo. DeepSeek-V4 uses our Hash routing approach developed back in 2021 -- see screenshot of their tech report! (Looks like a great model, congrats!)

Bonus note: our same blogpost (& paper) back in 2021 also introduced 'looped transformers', but we called that staircase & ladder (see screenshot): https://t.co/widkeEXz56

https://t.co/PQLdPKg9PS

449

164

32K

Benjamin_eecs retweeted

Deli Chen

@victor207755822

about 1 month ago

DeepSeek-V3: Dec 26, 2024 DeepSeek-V4: Apr 24, 2026 484 days later, we humbly share our labor of love. As always, we stay true to long-termism and open source for all. AGI belongs to everyone. ❤️🌍 #DeepSeekV4 #AGIforEveryone #OpenSource

352

13K

Benjamin_eecs retweeted

Yu Su

@ysu_nlp

about 1 month ago

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

874

134

365

186K

Benjamin_eecs retweeted

Jason Weston

@jaseweston

2 months ago

🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts. We study how to improve reasoning on hard tasks (e.g., math expressions) via: • better training data (& new evals) • better reward models (on-policy trained) • better inference methods (on-policy trained) 📝: https://t.co/ChcQyMDWw1

jaseweston's tweet photo. 🧮 Reasoning over Mathematical Objects 🧮

Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts.

We study how to improve reasoning on hard tasks (e.g., math expressions) via:
• better training data (& new evals)
• better reward models (on-policy trained)
• better inference methods (on-policy trained)

📝: https://t.co/ChcQyMDWw1

205

159

15K

Benjamin_eecs retweeted

Jason Weston

@jaseweston

2 months ago

🔗Learning to Aggregate through Online RL🎯 ParaGator🔀🐊: strong parallel reasoning aggregation Core claim: aggregation works best when training both stages together: - LLM generator should produce diverse candidates - LLM aggregator should synthesize into final answer ParaGator trains candidate generation with pass@k, and aggregation with pass@1 on-policy, end-to-end. Stops mode collapse/off-policy mismatch. Improves math & scientific reasoning. 🚀🏆 Read more in the blog post: https://t.co/FVQ1KjoTLs

jaseweston's tweet photo. 🔗Learning to Aggregate through Online RL🎯

ParaGator🔀🐊: strong parallel reasoning aggregation

Core claim: aggregation works best when training both stages together:
- LLM generator should produce diverse candidates
- LLM aggregator should synthesize into final answer

ParaGator trains candidate generation with pass@k, and aggregation with pass@1 on-policy, end-to-end.

Stops mode collapse/off-policy mismatch.
Improves math & scientific reasoning. 🚀🏆

Read more in the blog post: https://t.co/FVQ1KjoTLs

122

11K

Benjamin_eecs retweeted

Jason Weston

@jaseweston

2 months ago

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: https://t.co/50Of5rsanm

jaseweston's tweet photo. 🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧

RLLM = RL + LM-as-RM:

- post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks.

- trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈

- uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆

Read more in the blog post: https://t.co/50Of5rsanm

308

277

26K

Bo Liu (Benjamin Liu)

@Benjamin_eecs

2 months ago

@DavidJFan Congrats man :))

113

Benjamin_eecs retweeted

Seungone Kim

@seungonekim

3 months ago

🧮New work from @AIatMeta & @LTIatCMU! LM reasoning benchmarks mostly use simple answers like numbers (AIME) or multiple-choice options (GPQA). But for complex mathematical objects, performance drops sharply. We propose a set of solutions to solve this: https://t.co/DCZcnBhztq

seungonekim's tweet photo. 🧮New work from @AIatMeta & @LTIatCMU!

LM reasoning benchmarks mostly use simple answers like numbers (AIME) or multiple-choice options (GPQA). But for complex mathematical objects, performance drops sharply.

We propose a set of solutions to solve this:
https://t.co/DCZcnBhztq https://t.co/Hvz8la8S6C

10K

Benjamin_eecs retweeted

Jason Weston

@jaseweston

3 months ago

🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: https://t.co/2VlT2PIxrX

jaseweston's tweet photo. 🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐

We release:
- PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ)
- Principia Collection: training data that improves reasoning across the board.

For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions.

We show that this ends up improving the overall reasoning ability of your model for all tasks.

Read more in the blog post: https://t.co/2VlT2PIxrX

127

13K

Benjamin_eecs retweeted

Xidong Feng @Xidong_Feng

3 months ago

We've witnessed a crazy concurrent line of work on on-policy self-distillation in LLMs, and I truly believe this is the next paradigm of RL. Back in 2024, we proposed this exact conceptual shift in our paper, Natural Language Reinforcement Learning (NLRL). The real breakthrough here isn't just the specific distillation mechanics. It’s that RL is fundamentally shifting away from the traditional "sample -> then filter or amplify" approach. Instead of passively waiting to stumble upon a good action to upweight, the field is moving toward true synthetic language data generation from experience, which enables true continual learning. You can see this exact recipe playing out across all the recent hit papers: • RLTF (2602.02482): Text critiques as privileged info • OPSD (2601.18734): Ground-truth solutions • SDPO (2601.20802): Runtime errors & execution feedback • ERL(2602.13949): Self-reflections & demonstrations Instead of just using a scalar reward to filter bad rollouts, they all use language feedback to explicitly generate a corrected, high-quality trajectory in hindsight, and then distill that competence back into the base policy. While the specific ways we adapt RL to LLMs are still rapidly evolving, the core vision we outlined in NLRL holds true today: a single scalar is simply too poor of a carrier for credit assignment. When people talk about "experiential memory" for agents today, they are essentially describing what we framed as a Language Value Function (LVF)—not just RAG over past episodes, but storing the structured, strategy-level "why" behind what worked. And what we called "Language Policy Improvement" is exactly this feedback-aware self-distillation loop we see everywhere now. Language, not scalars, is the future of RL. 📄 Check out our early exploration of this framework here: https://t.co/k94IQxs8eC

203

192

32K

Bo Liu (Benjamin Liu)

@Benjamin_eecs

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users