Binfeng Xu @billxbf - Twitter Profile

Pinned Tweet

8 days ago

Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change. Find a problem, design the harness, and train your own agents! 🧵

billxbf's tweet photo. Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change.

Find a problem, design the harness, and train your own agents! 🧵

25

896

144

943

128K

Binfeng Xu

@billxbf

about 8 hours ago

@YichuanM Good training and rollout infra are just half of the story. Task & env generation are the expensive part most labs won’t share. Single task+docker can cost you $1000+ I’d recommend CUA Gym from @BowenWangNLP to see some synthetic scaling approaches

1

2

0

1

357

Binfeng Xu

@billxbf

about 9 hours ago

how about /sleep-and-learn for internalizing and updating the weights?

Thariq

@trq212

1 day ago

https://t.co/R6exTuF7P8

194

8K

990

19K

2M

0

5

0

1

888

Binfeng Xu

@billxbf

1 day ago

@natolambert @allen_ai looking forward to see what’s next! 🐐🐐

0

1

0

223

Who to follow

freddygump

@freddygump1

Most of my posts are about AI - Head of AI at a scale-up by day, AI tinkerer by night

Assem

@achammah1

I like solving problems | ⚡️building @nexus_gpt

Arash Ahmadian

@aahmadian_

Research Scientist @GoogleDeepmind, Gemini RL & post-training, Gemini 3. prev: @Cohere @CohereForAI

billxbf retweeted

NVIDIA AI

@NVIDIAAI

3 days ago

Nemotron 3 Ultra is coming this week. ⌛️

106

3K

356

465

381K

Binfeng Xu

@billxbf

3 days ago

@willccbb @benglickenhaus I’m moving (back) to cursor for composer 2.5

0

2

0

135

Binfeng Xu

@billxbf

4 days ago

@JoshPurtell You can use any PO as long as scoring is unbiased (“verified”).

0

3

0

1

170

Binfeng Xu

@billxbf

5 days ago

Besides token faithfulness (TITO), there are a few more challenges I noted in long form agent RL, tldr: - Rollout takes 80%+ overall time. Long tail (eg. looping errors) rollout are ubiquitous, and so efficient async RL is a must. - Correctly handling policy drifting during async RL. Strike a balance between efficiency and correctness (staleness). - Scarcity of reward is a pain. Simple outcome testing can encourage suboptimal intermediate steps. So PRM style correction (by the right amount) is important. - Environment cleanness and consistency are crucial. Reward hacking usually results from dirty env construction (eg. leaking files). Besides, mismatch between training environments and test-time harness harms more than you think. - I wrote more about these in a recent blog here: https://t.co/rIKaXNZ7Zk we solved most these problem with Polar and are patching up the rest. Stay tuned for upcoming updates!

clem 🤗

@ClementDelangue

6 days ago

Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal. The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right. Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥 https://t.co/zmx0EQl3jM

ClementDelangue's tweet photo. Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.

Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error.

The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal.

The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right.

Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥

https://t.co/zmx0EQl3jM

49

1K

140

1K

1M

3

136

15

134

18K

Binfeng Xu

@billxbf

6 days ago

@DavidSHolz @skypilot_org worths more attention https://t.co/U7mhaU8ArV

0

6

1

4

443

Binfeng Xu

@billxbf

6 days ago

@swaapppyyy should be open once oss board finishes their review.

1

0

75

Binfeng Xu

@billxbf

8 days ago

Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change. Find a problem, design the harness, and train your own agents! 🧵

25

896

144

943

128K

Binfeng Xu

@billxbf

6 days ago

@MarkoVelich @HaoZhang3438830 @ShaokunZhang1 @songyang_han @shizhediao @yunhengjackiez @NVIDIAAI Yes textual feedback SDPO is an important example we are adding.

0

2

0

1

64

Binfeng Xu

@billxbf

7 days ago

Hi Will, excuse us if we missed the feature in PrimeRL. Early this year we checked around the space but didn't find anything in OSS converting openai_chat / openai_responses / anthropic_messages / google_generatecontent into & from inference servers' OAI compatible format. So we decided to build our own converter. I think Harbor can only do same-type model switch today (correct me if I'm wrong) and we are trying to fill that gap (eg. training Qwen on ClaudeCode). It's hard to keep track of the changing world now. So thanks for pointing it out. We'll edit the tech report over inaccurate statements.

2

5

0

2

548

Binfeng Xu

@billxbf

7 days ago

@ChaskinSaroff @HaoZhang3438830 @ShaokunZhang1 @songyang_han @shizhediao @yunhengjackiez @NVIDIAAI We have many such experiments to do on the list. Just want to release the tool first since the community obviously needs one now. 🙂

0

1

0

77

billxbf retweeted

Bowen Wang

@BowenWangNLP

8 days ago

RLVR has become the recipe for agentic post-training. But for Computer-Use Agents, the bottleneck is not the algorithm, it is the data. 🐌 🚀 We introduce CUA-Gym: a scalable, lightweight synthesis engine that turns arbitrary task queries into verifiable RLVR data for computer-use agents. The largest open CUA RLVR dataset to date: 🎯 32,122 verifiable RLVR tasks with programmatic setup scripts + rewards 🌐 110 environments: 16 desktop apps + 94 synthesized mock web apps 🏆 Qwen3.5-based CUA models trained with GSPO reach 72.6% on OSWorld-Verified and 56.6% on WebArena 📄 Paper: https://t.co/cdvHJPzgb1 🏠 Homepage: https://t.co/kvhaOQxNx7 🤗 Dataset: https://t.co/w5vOIRdchR 💻 Codebase: https://t.co/CcRlNTlS1c 🧩 Environments: https://t.co/fNZ6YAI8LD 🧵[1/6]

18

503

94

562

96K

Binfeng Xu

@billxbf

8 days ago

Yes you are right about the unfamiliar apply_patch tool. Besides codex relies on bash only while other harnesses provide shortcuts like grep/glob/edit tools. Pi > QwenCode after training remains a mystery to me as well atm. Might be that simpler system prompt helps RL performance.

1

2

0

376

Binfeng Xu

@billxbf

8 days ago

@gneubig @OpenHandsDev Thanks Graham! The idea there is to have arbitrary harnesses interpreted without opening the box. We have a openhands-sdk shortcut tested already since it avoids nested runtime. Openhands should also work though the runtime binding is heavier — would love to add and test soon.

1

0

1K

Binfeng Xu

@billxbf

8 days ago

📜Paper: https://t.co/gYfCyIGLRJ 💻Codebase: https://t.co/vEXS9GfQKW Thanks to the following contributors! @HaoZhang3438830 @ShaokunZhang1 @songyang_han Mingjie Liu, Jian Hu, @shizhediao Zhenghui Jin, @yunhengjackiez. This work is done at @NVIDIAAI with great mentorship from @jankautz, @doyend, Michael Demoret. The work draw inspiration from open-source contributions @slime_framework @sgl_project @radixark and @nvidia Megatron, RL and Gym family 💚.

2

42

4

50

4K

Binfeng Xu

@billxbf

8 days ago

(6/6) Next up: adding more evaluators (eg. PRM-style credit assignment), bridging to more trainiing frameworks, more tasks (eg. computer-use, self-evolving, etc.) and end-to-end optimizations (eg. global KV Caching, in-group speculative decoding, etc.). We believe low-intrusion rollout-as-a-service that decouples environments and trainers is the right factoring for the next generation of agent RL. 🤗

1

20

1

4

5K

Binfeng Xu

@billxbf

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users