Thomas Kleine Buening @thomasklbg - Twitter Profile

Pinned Tweet

4 months ago

Deployed LLMs and users generate millions of conversations every day. These are full of useful learning signals, yet we don't use them for training. We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models.

thomasklbg's tweet photo. Deployed LLMs and users generate millions of conversations every day.

These are full of useful learning signals, yet we don't use them for training.

We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models. https://t.co/he3Od43TFm

9

254

36

226

55K

thomasklbg retweeted

Trajectory

@trajectorylabs

9 days ago

5 Days of Trajectory 🏹Day 5: Scaling SDPO to Agentic Tasks Continual learning means you must train on data from production. But production gives you one example per task. A user makes a request once. You get one trajectory, not a batch. However, current RL algorithms don't work that way, They need groups of tasks. By definition, that means you need some artificial environment to perform those rollouts in. But what if you don't? SDPO is a promising route. It learns from a single trajectory, with no group required and failures still producing signal. The shape of the method matches the shape of production data. But one fundamental problem remained. Every published SDPO work assumed fresh, on-policy rollouts. Agentic work cannot give you that. Trajectories run for an hour or more and arrive stale. On true agentic tasks, naive SDPO collapses. We fixed it. We're the first to make SDPO work on agentic tasks. On Mercor's APEX-Agents, with hour-long trajectories and near-zero base pass rates: 25% average reward, 5x over zero-shot. More importantly, it trains stably and the curve is still climbing. Read more below.

trajectorylabs's tweet photo. 5 Days of Trajectory

🏹Day 5: Scaling SDPO to Agentic Tasks

Continual learning means you must train on data from production. But production gives you one example per task. A user makes a request once. You get one trajectory, not a batch.

However, current RL algorithms don't work that way, They need groups of tasks. By definition, that means you need some artificial environment to perform those rollouts in. But what if you don't?

SDPO is a promising route. It learns from a single trajectory, with no group required and failures still producing signal. The shape of the method matches the shape of production data.

But one fundamental problem remained. Every published SDPO work assumed fresh, on-policy rollouts. Agentic work cannot give you that. Trajectories run for an hour or more and arrive stale. On true agentic tasks, naive SDPO collapses.

We fixed it. We're the first to make SDPO work on agentic tasks.

On Mercor's APEX-Agents, with hour-long trajectories and near-zero base pass rates: 25% average reward, 5x over zero-shot. More importantly, it trains stably and the curve is still climbing.

Read more below.

9

129

11

101

39K

thomasklbg retweeted

Edward Hughes

@edwardfhughes

13 days ago

Proud to announce the launch of @inherent_labs. We’re reinventing the scientific research factory for the age of AI agents. I’m joined by co-founders @kallyaleksiev, @LouisKirschAI and @TantumSCollins; all are deeply technical operators. Time to live within the experiment.

8

164

18

37

19K

thomasklbg retweeted

Ronak Malde

@rronak_

15 days ago

We have been exploring new algorithmic frontiers and are excited to share our contributions to Self Distillation Policy Optimization (SDPO) for agentic continual learning, check out our blog post here: https://t.co/5xjL02jtUz

3

70

6

22

39K

thomasklbg retweeted

Jonas Hübotter

@jonashubotter

24 days ago

Self-distillation for long-horizon training at scale!

1

67

5

9

5K

thomasklbg retweeted

idan shenfeld

@IdanShenfeld

about 1 month ago

Self-distillation can reduce hallucinations when teaching LLMs new knowledge. I think the first time I heard about how RL enable learning without increased hallucination was in @johnschulman2 talk in 2023. Turns out, like many of RL’s benefits, this one also comes from learning on-policy.

2

117

17

103

12K

thomasklbg retweeted

Jonas Hübotter

@jonashubotter

about 2 months ago

Today and tomorrow we’ll be presenting self-distillation with orals at ICLR in Rio 🇧🇷 1. “Self-Distillation enables Continual Learning” at lifelong agents workshop (Sun 11:30am) 2. “Reinforcement Learning via Self-Distillation” at scaling post-training workshop (Mon 2:40pm) 3. “Test-Time Self-Distillation” at test-time updates workshop (Mon 4:15pm)

jonashubotter's tweet photo. Today and tomorrow we’ll be presenting self-distillation with orals at ICLR in Rio 🇧🇷

1. “Self-Distillation enables Continual Learning” at lifelong agents workshop (Sun 11:30am)
2. “Reinforcement Learning via Self-Distillation” at scaling post-training workshop (Mon 2:40pm)
3. “Test-Time Self-Distillation” at test-time updates workshop (Mon 4:15pm)

10

431

48

276

102K

thomasklbg retweeted

Barna Pásztor @pasztorb

about 2 months ago

What do you do when reward models fail in RLHF? Scalar rewards flatten messy, context dependent human preferences into a single number. The reward model learns a distortion, and the policy optimizes it faithfully. 🧵

1

21

2

7

2K

Thomas Kleine Buening

@thomasklbg

3 months ago

@ytz2024 The hindsight-guided OPD from OpenClaw-RL also seems very related @YinjieW2024 @LingYang_PU https://t.co/hfsCwAY2Yw

Yinjie Wang

@YinjieW2024

3 months ago

OpenClaw-RL Technical Report! Make your🦞@openclaw stronger by just using it. We propose a method that combines the advantages of GRPO and OPD, and evalution results. The repo is already 1.7k stars now, feel free to contribute! Come in and have fun~ @MengdiWang10 @LingYang_PU

YinjieW2024's tweet photo. OpenClaw-RL Technical Report! Make your🦞@openclaw stronger by just using it. We propose a method that combines the advantages of GRPO and OPD, and evalution results. The repo is already 1.7k stars now, feel free to contribute! Come in and have fun~
@MengdiWang10 @LingYang_PU https://t.co/MKO8CyWbFI

36

713

128

690

65K

0

4

0

192

Thomas Kleine Buening

@thomasklbg

3 months ago

@ytz2024 It seems very related to on-policy self-distillation from user interactions. The main difference I see is that you have the extra step of knowledge extraction step, whereas in SDPO we learn directly from raw interactions https://t.co/Ut4ItEh6vA

Thomas Kleine Buening

@thomasklbg

4 months ago

Deployed LLMs and users generate millions of conversations every day. These are full of useful learning signals, yet we don't use them for training. We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models.

9

254

36

226

55K

2

6

0

1

357

thomasklbg retweeted

Ling Yang

@LingYang_PU

3 months ago

We've integrated on-policy distillation RL methods (SDFT and SDPO) into OpenClaw-RL's pipeline, working directly with the original authors! @IdanShenfeld @thomasklbg OpenClaw-OPD now supports even more effective learning paradigms for personalized AI agents trained from natural conversation feedback via @openclaw. We welcome the integration of novel and effective methods — if you have ideas, let's build together 🤝 🔗 https://t.co/ry18qekutm

1

131

18

102

8K

thomasklbg retweeted

Yinjie Wang

@YinjieW2024

3 months ago

Working with the authors @IdanShenfeld @thomasklbg of this excellent series of papers, we have integrated their novel on-policy distillation methods into OpenClaw-RL. We welcome integration of new and effective methods. Make your personal @openclaw🦞agents stronger every day.

YinjieW2024's tweet photo. Working with the authors @IdanShenfeld @thomasklbg of this excellent series of papers, we have integrated their novel on-policy distillation methods into OpenClaw-RL. We welcome integration of new and effective methods. Make your personal @openclaw🦞agents stronger every day.

1

16

5

10

2K

thomasklbg retweeted

Yinjie Wang

@YinjieW2024

3 months ago

Train your 🦞@openclaw simply by talking to it. Meet OpenClaw-RL. Host your model on our RL server, and your LLM gets optimized automatically. Use it anywhere. Keep it private. Make it more personal every day. We have fully open sourced everything. Come in and have fun!

43

744

87

981

99K

Thomas Kleine Buening

@thomasklbg

3 months ago

@YinjieW2024 @openclaw For your Training Method 2, would be super cool to use the slightly more direct approach of self-distillation from user interactions. Then you don’t need another model to provide hints (no additional generation and cheap training):

Thomas Kleine Buening

@thomasklbg

4 months ago

Deployed LLMs and users generate millions of conversations every day. These are full of useful learning signals, yet we don't use them for training. We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models.

9

254

36

226

55K

0

2

0

126

Thomas Kleine Buening

@thomasklbg

4 months ago

@jonashuebotter @pasztorb @IdanShenfeld @gio_ramponi @arkrause And continual learning: https://t.co/VrkY4kKgqQ

idan shenfeld

@IdanShenfeld

4 months ago

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

IdanShenfeld's tweet photo. People keep saying 2026 will be the year of continual learning.

But there are still major technical challenges to making it a reality.

Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning!

(1/n) https://t.co/tuDTBATlTQ

50

2K

223

1K

239K

0

13

3

8

1K

Thomas Kleine Buening

@thomasklbg

4 months ago

Deployed LLMs and users generate millions of conversations every day. These are full of useful learning signals, yet we don't use them for training. We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models.

9

254

36

226

55K

Thomas Kleine Buening

@thomasklbg

4 months ago

@jonashuebotter @pasztorb @IdanShenfeld @gio_ramponi @arkrause Also checkout our other recent work on applying self-distillation to RL: https://t.co/iruBFprkYR

Jonas Hübotter

@jonashubotter

4 months ago

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

jonashubotter's tweet photo. Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed.

Today, we introduce a simple algorithm that enables the model to learn from any rich feedback!
And then turns it into dense supervision.

(1/n) https://t.co/AR0yWgaKnL

22

1K

139

1K

211K

1

14

4

3

2K

Thomas Kleine Buening

@thomasklbg

Last Seen Users on Sotwe

Trends for you

Most Popular Users