catherine @cat_eye_on - Twitter Profile

catherine

@cat_eye_on

3 days ago

@PatrickToulme @icanvardar ok u cooked with this

0

22

cat_eye_on retweeted

Yifan Wu

@yifannnwu

7 days ago

Been thinking about this for a while, as tasks go to more and more turns and longer horizons, PPO is much more elegant for giving dense, per-turn reward. And here we go.

0

67

3

34

9K

catherine

@cat_eye_on

7 days ago

@Stone_Tao @physical_int congrats Pebble Tao!

0

1

0

153

catherine

@cat_eye_on

12 days ago

@neprodian What have you been doing to squeeze perf out of a single thread?

0

73

Who to follow

gabby chan

@yccgabby

working on methods to study/treat the brain, reproducibility + trust in science, essays ab future of tech/culture etc. dormant swe. neuro @eth, syde @uwaterloo

Aspiring physician-innovator interested in tech (AI/ML) and global health policy in med and oncology. Writer. M3 @VTCSOM. Editorial @npjDigitalMed.

catherine

@cat_eye_on

13 days ago

@Stone_Tao @jsuarez 👀

0

1

0

30

catherine

@cat_eye_on

13 days ago

@creet_z @m_sirovatka DUDE HAHAHA

0

1

0

30

catherine

@cat_eye_on

13 days ago

@theyangward Not super familiar with MARL, but does it make sense to have one policy per agent and an “orchestrator” policy

1

0

39

catherine

@cat_eye_on

13 days ago

@theyangward Why doesn’t it scale?

1

0

26

catherine

@cat_eye_on

14 days ago

@sheriyuo when did we start distilling dLLMs? first i’m hearing of this

0

44

catherine

@cat_eye_on

15 days ago

@chenwanch1 @AIatMeta Congrats!

0

119

catherine

@cat_eye_on

17 days ago

@JustinLin610 worded my thoughts perfectly

0

181

catherine

@cat_eye_on

17 days ago

@yacinelearning the way this is how i learned what a load bearing token is

0

2

0

22

catherine

@cat_eye_on

17 days ago

@himanshustwts sounds kinda interesting no?

0

161

catherine

@cat_eye_on

18 days ago

@fujikanaeda whos your art guy 👀 @creet_z

1

0

154

catherine

@cat_eye_on

18 days ago

@bilaltwovec Truly, I'm so excited to read the paper

0

1

0

116

catherine

@cat_eye_on

18 days ago

@shreybirmiwal doesn’t need to be a competition, both are pretty cool

0

4

0

1K

catherine

@cat_eye_on

18 days ago

@shatayumk sparse rewards

0

1

0

24

catherine

@cat_eye_on

19 days ago

@charles_irl @modal was gonna say you all are super quick with it but I guess you probably had access to this in advance haha

0

1

0

111

cat_eye_on retweeted

wh

@nrehiew_

19 days ago

This paper prompted me to do a review of NVFP4 pre-training, given that NVIDIA seems to be pushing support for it especially on Blackwells. Much of the content will come from "Pretraining Large Language Models with NVFP4" and the Nemotron 3 Super paper 🧵

3

88

8

86

43K

cat_eye_on retweeted

wh

@nrehiew_

20 days ago

The main highlight is that NVIDIA did NVFP4 pretraining. Much of the recipe follows previous Nemotron work: - Hadamard Transforms applied to weight gradient computation to reduce the impact of outliers. - Some layers kept at higher precision. (Table from Nemotron 3 Super). Specifically, final layers tend to require more dynamic range and mantissa than FP4 provides. - Stochastic rounding rather than deterministic rounding to prevent bias, specifically in the gradients. To validate the recipe, they train smaller models up to 16T and show a mere ~0.4% relative train loss gap with the bf16 baseline. See more: https://t.co/XwBLFEg7QV (we discuss other stability issues in later sections)

nrehiew_'s tweet photo. The main highlight is that NVIDIA did NVFP4 pretraining. Much of the recipe follows previous Nemotron work:
- Hadamard Transforms applied to weight gradient computation to reduce the impact of outliers.
- Some layers kept at higher precision. (Table from Nemotron 3 Super). Specifically, final layers tend to require more dynamic range and mantissa than FP4 provides.
- Stochastic rounding rather than deterministic rounding to prevent bias, specifically in the gradients.

To validate the recipe, they train smaller models up to 16T and show a mere ~0.4% relative train loss gap with the bf16 baseline.

See more: https://t.co/XwBLFEg7QV

(we discuss other stability issues in later sections)

3

23

2

5

3K

catherine

@cat_eye_on

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users