Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change.
Find a problem, design the harness, and train your own agents! 🧵
Excited to introduce ProRL Agent: Rollout-as-a-Service for RL training of multi-turn LLM agents! 🚀
As we move toward complex agentic tasks, rollout infrastructure is often a bottleneck. We’re decoupling I/O-heavy rollouts from GPU training via a unified HTTP API.
Why ProRL Agent?
Decoupled & Scalable: Treats rollout as a service, allowing near-linear throughput scaling.
System-Level Optimization: Includes load balancing and automated sandbox cleanup for high stability.
Integrated: Now part of NVIDIA NeMo Gym to help researchers scale RL pipelines faster.
The Results 📈
On SWE-bench-Verified, we saw significant gains:
+8.4 on Qwen3-8B
+8.2 on Qwen3-14B
Proven success across STEM, Math, and General Coding agents.
Check out the research and open-source code: 📄 Paper: https://t.co/l4wR6SbJ7m💻 Repo: https://t.co/5otcyzkDKe
Huge thanks to the team and NVIDIA for the support! 👏
We’re open-sourcing the data and model behind Golden Goose 🦢✨. Check them out and see how we turn unverifiable internet text 🌐 into large-scale RLVR tasks 😎.
📊 GooseReason-0.7M: https://t.co/xBu9KC5Q9F
🤖 GooseReason-4B-Instruct: https://t.co/iT2ViXGbqM
There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱
Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐
You asked and we listened
The @nvidia ProfBench leaderboard 🏆 is here on @huggingface : https://t.co/W9PE6rbzfq
One design we have for the leaderboard is that we distinguish open-weight vs closed-source models and reasoning vs instruct model. Separately, we also show the cost of running the entire benchmark (thanks to @openrouter for putting prices in one place) because real world users absolutely care about prices.
Putting this together with @viviennezhangx, we were surprised to find that open-weight models can sometimes perform at a similar level to closed-source models but at cents on the dollar. 🤑
Thanks @ClementDelangue@imohitmayank
for the amazing suggestion!
What models do you want to see on there next? Comment below and I’ll run it (nothing crazy though)
#ProfBench #LLM #AIevaluation #NeMo #NVIDIA #OpenSourceAI #AIresearch #AgenticAI #GenerativeAI #BuiltByExperts #GTCDC
We built ProfBench to raise the bar for LLMs - literally.
At @NVIDIA, we worked with domain experts to create a benchmark that goes far beyond trivia and short answers.
ProfBench tests LLMs on complex, multi-step tasks that demand the kind of reasoning, synthesis, and clarity you'd expect from a PhD physicist or MBA consultant.
🌎 This isn’t just a dataset drop. It’s a global collaboration: 38 professionals across 8 countries contributed over 7,000 expert-written rubrics across finance MBA 💵, consulting MBA 📊, chemistry PhD 🧪and physics PhD 🚀.
🧗Every prompt and grading rubric was handcrafted, requiring tens of hours of dedicated and focussed work.
Now fully supported in the NeMo Evaluator SDK, ProfBench enables reproducible, rubric-based evaluations and side-by-side model comparisons.
🔗 ProfBench on @HuggingFace https://t.co/wmOyvLY6e7
🔗 NeMo Evaluator SDK https://t.co/JgFJklQqPr
I’m so proud of the team that made this happen. Let’s keep pushing what AI can do.
Work done with @jaehunjung_com@GXiming@shizhediao Ellie Evans @jiaqizengggggg@PavloMolchanov@YejinChoinka@jankautz@doyend
#ProfBench #LLM #AIevaluation #NeMo #NVIDIA #OpenSourceAI #AIresearch #AgenticAI #GenerativeAI #BuiltByExperts #GTCDC
Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough!
Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering new insights into the debate.
Introducing Reasoning Gym: Over 100 procedurally generated reasoning environments for evaluation and RLVR of language models. Generate virtually infinite training or evaluation data with fine-grained difficulty control and automatic verifiers. 🧵 1/
I really like the muTransfer paper(https://t.co/4Wfxw4Duvd). To help me understand the paper better, I wrote a blog to derive some of the missing equations in the paper. https://t.co/lwsaMOCmkI
Thank you @TheGregYang for the wonderful theoretical work!
I just watched this video and was super impressed by how well @ykilcher communicated the essence of our paper. If you want to understand why AlphaZero can't play poker and why ReBeL can, this is a great video to watch!
Learn how to achieve a 100x speedup using @numba_jit and @rapidsai for efficient and fast fractional differencing computation on #GPUs. https://t.co/YKYjp4RbED
Learn how you can achieve up to 20x speedup in your Quant workflow by leveraging #gQuant, a set of finance examples built on RAPIDS. https://t.co/6iIjaKlqhd
To help researchers and data scientists in #finance accelerate their workflows with @rapidsai, we've published a new technical post highlighting a few #gQuant finance examples demonstrating the value of GPU accelerated #datascience: https://t.co/InYES8WmJ6