Excited to release ๐Polar๐, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones ๐ฅ -- Polar takes your harnesses directly as training environments without code change.
Find a problem, design the harness, and train your own agents! ๐งต
@YichuanM Good training and rollout infra are just half of the story. Task & env generation are the expensive part most labs wonโt share.
Single task+docker can cost you $1000+
Iโd recommend CUA Gym from @BowenWangNLP to see some synthetic scaling approaches
Besides token faithfulness (TITO), there are a few more challenges I noted in long form agent RL, tldr:
- Rollout takes 80%+ overall time. Long tail (eg. looping errors) rollout are ubiquitous, and so efficient async RL is a must.
- Correctly handling policy drifting during async RL. Strike a balance between efficiency and correctness (staleness).
- Scarcity of reward is a pain. Simple outcome testing can encourage suboptimal intermediate steps. So PRM style correction (by the right amount) is important.
- Environment cleanness and consistency are crucial. Reward hacking usually results from dirty env construction (eg. leaking files). Besides, mismatch between training environments and test-time harness harms more than you think.
- I wrote more about these in a recent blog here: https://t.co/rIKaXNZ7Zk
we solved most these problem with Polar and are patching up the rest. Stay tuned for upcoming updates!
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.
Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error.
The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal.
The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right.
Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL ๐ค๐ฅ
https://t.co/zmx0EQl3jM
Excited to release ๐Polar๐, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones ๐ฅ -- Polar takes your harnesses directly as training environments without code change.
Find a problem, design the harness, and train your own agents! ๐งต
Hi Will, excuse us if we missed the feature in PrimeRL. Early this year we checked around the space but didn't find anything in OSS converting openai_chat / openai_responses / anthropic_messages / google_generatecontent into & from inference servers' OAI compatible format. So we decided to build our own converter. I think Harbor can only do same-type model switch today (correct me if I'm wrong) and we are trying to fill that gap (eg. training Qwen on ClaudeCode).
It's hard to keep track of the changing world now. So thanks for pointing it out. We'll edit the tech report over inaccurate statements.
RLVR has become the recipe for agentic post-training. But for Computer-Use Agents, the bottleneck is not the algorithm, it is the data. ๐
๐ We introduce CUA-Gym: a scalable, lightweight synthesis engine that turns arbitrary task queries into verifiable RLVR data for computer-use agents. The largest open CUA RLVR dataset to date:
๐ฏ 32,122 verifiable RLVR tasks with programmatic setup scripts + rewards
๐ 110 environments: 16 desktop apps + 94 synthesized mock web apps
๐ Qwen3.5-based CUA models trained with GSPO reach 72.6% on OSWorld-Verified and 56.6% on WebArena
๐ Paper: https://t.co/cdvHJPzgb1
๐ Homepage: https://t.co/kvhaOQxNx7
๐ค Dataset: https://t.co/w5vOIRdchR
๐ป Codebase: https://t.co/CcRlNTlS1c
๐งฉ Environments: https://t.co/fNZ6YAI8LD
๐งต[1/6]
Yes you are right about the unfamiliar apply_patch tool. Besides codex relies on bash only while other harnesses provide shortcuts like grep/glob/edit tools.
Pi > QwenCode after training remains a mystery to me as well atm. Might be that simpler system prompt helps RL performance.
@gneubig@OpenHandsDev Thanks Graham! The idea there is to have arbitrary harnesses interpreted without opening the box. We have a openhands-sdk shortcut tested already since it avoids nested runtime. Openhands should also work though the runtime binding is heavier โ would love to add and test soon.
(6/6) Next up: adding more evaluators (eg. PRM-style credit assignment), bridging to more trainiing frameworks, more tasks (eg. computer-use, self-evolving, etc.) and end-to-end optimizations (eg. global KV Caching, in-group speculative decoding, etc.).
We believe low-intrusion rollout-as-a-service that decouples environments and trainers is the right factoring for the next generation of agent RL. ๐ค