Grad

Verified account

@Grad62304977

Joined October 2020

2.8K Following

9.1K Followers

3.8K Posts

about 16 hours ago

ya i like this, initially i used to think deepseek r1 was using way bigger group sizes and batch sizes than it was Also this speaks to the beauty of RL I can just say fuck it and double my group size and get reliably better performance and more compute spent. Cant really do the same in pretraining

1

2

0

2

275

about 23 hours ago

Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage

Grad62304977's tweet photo. Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage https://t.co/o5yjaCAvtJ

13

185

6

60

14K

about 16 hours ago

@willccbb ya of course was just saying i dont think many ppl actually realise how big the batches can become

1

8

0

0

602

about 22 hours ago

@SeunghyunSEO7 nah for RL this is good Under-talked abt aspect where u could genuinely go to 2B token batch sizes or more and its still better Pretraining batch size tbh was pretty big esp with adamW, some interesting adamW hparams although maybe related to large batch sizes?

1

15

1

2

639

about 23 hours ago

@ar0cket1 That’s fine But no the thing with RL is so far bigger batch sizes are more stable and give better performance

2

2

0

0

633

about 23 hours ago

In general RL seems to have a very different batch size scaling than pretraining Also even in their previous shorter context stages, the batch sizes were always bigger or equal to the pretraining batch size

1

41

3

4

3K

about 24 hours ago

Grad62304977's tweet photo. @thepushkarp https://t.co/gSqUrksU5i

1

13

0

0

679

Grad62304977 retweeted

Matej Sirovatka

3 days ago

KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀

14

332

24

135

29K

Grad62304977 retweeted

4 days ago

ptc is the way

5

167

9

83

23K

5 days ago

@ColbyBanbury @Chert_Fu @lateinteraction Ya the organisational thing does steer in favor of expert RL then OPD. Although I do imagine that after setting up an initial mixed domain RL run, it’s easier to then mix stuff into the run from each team seperately

1

3

0

0

42

5 days ago

@willdepue @georgejrjrjr Also literally can’t say this stuff is trivial when from my understanding Claude code, codex, cursor… didn’t support PTC or dynamic context discovery (for stuff other than the codebase) until after RLMs (or still don’t support it)

0

1

0

0

77

6 days ago

Tbf a significant part of RLM is programmatic tool calling which Anthropic had blogs on months after the RLM initial blog with pretty big gains (I think they use it for certain evals too) Also things like tool search which many adopted and made blogs on for their importance, is a part of the RLM idea (iirc these were after the initial RLM blog in oct 2025) Also most of the cursor dynamic context discovery blog done around 2 months after the RLM blog are ideas RLM would have https://t.co/MflVrGGPlx https://t.co/JDC1ksSZXo https://t.co/eqOW9XfjJT

3

6

0

6

466

5 days ago

@ColbyBanbury @Chert_Fu @lateinteraction Nice! To be clear here btw, doing multi domain RL is a harder systems challenge as u deal with things like much bigger batch sizes But the overall compute should be much smaller actually than experts then OPD

1

1

0

0

48

6 days ago

@alexjc @willdepue @georgejrjrjr They cited them on the idea of programmatic context discovery (actually the part everyone is saying is the most obv) But if u see the blogs I linked, other aspects from RLM can provide big gains and things that Anthropic and cursor made blogs and introduced months after RLMs

0

3

0

0

52

6 days ago

- No I meant tools like subagents, can see the blogs I linked too for examples but these are ideas in RLMs - Well it’s not only the codebase but sure Mainly getting at ppl not holding cursor and Anthropic to the same standard as they release blogposts on this stuff without citing previous works (not even sure which ones there are other than RLMs) Here referring to all the blogs I linked

1

2

0

0

42

6 days ago

@alexjc @willdepue @georgejrjrjr No but calling tools like subagents, and reading large contexts through the REPL is a core idea here not an interpretation

1

1

0

0

46

6 days ago

I don’t think this makes sense RLM as an idea from my perspective is mainly the combination of subagents, PTC (programmatic tool calling), and programmatic context discovery If Claude starts saying how PTC is really good for something, it’s not invalid to say it’s more RLM like as it’s something an RLM could do and a key part of its performance

1

1

0

0

80

6 days ago

The GLM 5 case is a bit different as it’s not really seperate expert models as they were built on top of each other but fair Thought u were referring to nemotron 3 For OPD helping with forgetting, thats only the case for sequential stage RL not multi domain RL Also nemotron cascade 2 here is entirely a “skill issue” as they do fully on policy training so separating stages makes sense (no one in practice does fully on policy tho) To be clear too, I think OPD is a great part of the pipeline I just think the multi domain expert RL then OPD will mostly go away

1

1

0

0

51

6 days ago

@ColbyBanbury @Chert_Fu @lateinteraction And idk what qwen3.5 does, do u have a source for where they say that?

0

1

0

0

38

6 days ago

@ColbyBanbury @Chert_Fu @lateinteraction Main example is minimax and not sure which one kimi does Nemotron doesn’t do this GLM 5 doesn’t do this either

2

1

0

0

100

Last Seen Users on Sotwe

Trends for you

Most Popular Users