ya i like this, initially i used to think deepseek r1 was using way bigger group sizes and batch sizes than it was
Also this speaks to the beauty of RL
I can just say fuck it and double my group size and get reliably better performance and more compute spent. Cant really do the same in pretraining
@SeunghyunSEO7 nah for RL this is good
Under-talked abt aspect where u could genuinely go to 2B token batch sizes or more and its still better
Pretraining batch size tbh was pretty big esp with adamW, some interesting adamW hparams although maybe related to large batch sizes?
In general RL seems to have a very different batch size scaling than pretraining
Also even in their previous shorter context stages, the batch sizes were always bigger or equal to the pretraining batch size
KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀
@ColbyBanbury@Chert_Fu@lateinteraction Ya the organisational thing does steer in favor of expert RL then OPD.
Although I do imagine that after setting up an initial mixed domain RL run, it’s easier to then mix stuff into the run from each team seperately
@willdepue@georgejrjrjr Also literally can’t say this stuff is trivial when from my understanding Claude code, codex, cursor… didn’t support PTC or dynamic context discovery (for stuff other than the codebase) until after RLMs (or still don’t support it)
Tbf a significant part of RLM is programmatic tool calling which Anthropic had blogs on months after the RLM initial blog with pretty big gains (I think they use it for certain evals too)
Also things like tool search which many adopted and made blogs on for their importance, is a part of the RLM idea (iirc these were after the initial RLM blog in oct 2025)
Also most of the cursor dynamic context discovery blog done around 2 months after the RLM blog are ideas RLM would have
https://t.co/MflVrGGPlx
https://t.co/JDC1ksSZXo
https://t.co/eqOW9XfjJT
@ColbyBanbury@Chert_Fu@lateinteraction Nice!
To be clear here btw, doing multi domain RL is a harder systems challenge as u deal with things like much bigger batch sizes
But the overall compute should be much smaller actually than experts then OPD
@alexjc@willdepue@georgejrjrjr They cited them on the idea of programmatic context discovery (actually the part everyone is saying is the most obv)
But if u see the blogs I linked, other aspects from RLM can provide big gains and things that Anthropic and cursor made blogs and introduced months after RLMs
- No I meant tools like subagents, can see the blogs I linked too for examples but these are ideas in RLMs
- Well it’s not only the codebase but sure
Mainly getting at ppl not holding cursor and Anthropic to the same standard as they release blogposts on this stuff without citing previous works (not even sure which ones there are other than RLMs)
Here referring to all the blogs I linked
@alexjc@willdepue@georgejrjrjr No but calling tools like subagents, and reading large contexts through the REPL is a core idea here not an interpretation
I don’t think this makes sense
RLM as an idea from my perspective is mainly the combination of subagents, PTC (programmatic tool calling), and programmatic context discovery
If Claude starts saying how PTC is really good for something, it’s not invalid to say it’s more RLM like as it’s something an RLM could do and a key part of its performance
The GLM 5 case is a bit different as it’s not really seperate expert models as they were built on top of each other but fair
Thought u were referring to nemotron 3
For OPD helping with forgetting, thats only the case for sequential stage RL not multi domain RL
Also nemotron cascade 2 here is entirely a “skill issue” as they do fully on policy training so separating stages makes sense (no one in practice does fully on policy tho)
To be clear too, I think OPD is a great part of the pipeline I just think the multi domain expert RL then OPD will mostly go away