Philip Monk @pcmonk - Twitter Profile

3 months ago

@iamwil @dardezeu @patio11 It's a style that lets you say more things in a sentence without it devolving into a long series of comma-separated phrases. I think the terseness is a goal in itself though, and not necessarily specific to twitter.

0

1

0

40

Philip Monk

@pcmonk

3 months ago

@iamwil @patio11 the expectation was reasonable to him, though apparently not to japanese salarymen

1

2

0

244

Philip Monk

@pcmonk

5 months ago

Even ignoring hw utilization, you still want to avoid routing collapse. But maybe that's more about enforcing minimum usage than max usage? If an expert is chosen <5% as often as the average, that expert is almost certainly wasted and undertrained. But if an expert is chosen 5x as often as the average, that could just mean it's a generically useful function. I think shared experts are a concession to this, but there could be more granularity than "perfectly balanced" vs "always activated"

1

0

67

Philip Monk

@pcmonk

5 months ago

Re 2, it does seem like the variance loss should eventually load balance, but maybe that effect is not strong enough? Would be worth trying. Not the same, but this paper claims you can replace lb loss with an orthonormality loss on the router weights, and it still load balances. https://t.co/02se1meOZ3 The global-batch vs microbatch/sequence-wise lb loss distinction from the demons-in-the-detail paper you linked above seems important. With sequence-wise lb loss, you couldn't possibly get eg specialization by language, like they do in the last few layers, so I'm suspicious of a lot of the earlier papers that claim to get some kind of data-domain specialization while using microbatch-wise lb loss.

1

0

80

Who to follow

if you don't own it, it doesn't matter

drunkplato

@drunkplato

hyperstitional cartographer | human maximalist | hyperobject noticer

Philip Monk

@pcmonk

5 months ago

@_selebou @dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai (unless you mean a measure on the output of the experts, I assumed you mean which experts got activated)

1

0

73

Philip Monk

@pcmonk

5 months ago

@_selebou @dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai I also worry about using routing scores to measure specialization. I guess the times when you can draw these metrics are (am I missing any?): - routing weights - routing scores - expert weights - expert activations - impact on predictive loss I feel like later is better, usually

2

0

90

Philip Monk

@pcmonk

5 months ago

@dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @_selebou @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai A good proxy would be useful, especially if you can add it directly to the loss. Orthogonality of either the weights or the activations feels good, but I haven't seen a conclusive connection to "true" specialization yet. This is the closest: https://t.co/ejTzOisSqe

3

1

0

159

Philip Monk

@pcmonk

5 months ago

@dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @_selebou @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai I don't feel like I have a good collection of metrics for expert specialization yet. What feels closest to the actual objective is masking out each expert and seeing how much loss increases, but this is expensive to compute.

2

1

0

145

pcmonk retweeted

ollama

@ollama

6 months ago

.@essential_ai's rnj-1 model is now on Ollama! ollama run rnj-1 8B parameter, open-weight dense model trained from scratch. The model is optimized for code and STEM with capabilities on par with other state of the art open-weight models. Let's go! 🚀🚀🚀

ollama's tweet photo. .@essential_ai's rnj-1 model is now on Ollama!

ollama run rnj-1

8B parameter, open-weight dense model trained from scratch. The model is optimized for code and STEM with capabilities on par with other state of the art open-weight models.

Let's go! 🚀🚀🚀

9

241

34

70

31K

Philip Monk

@pcmonk

6 months ago

It's open weights and a very convenient size to run locally, btw. I get 20 tok/s on an M3 mac with llama.cpp.

0

8

0

433

Philip Monk

@pcmonk

6 months ago

It's been a blast to lead the infrastructure effort to train this model. I'm excited to see it out in the world!

Ashish Vaswani

@ashVaswani

6 months ago

We are beyond thrilled to share our first flagship models, Rnj-1 base and instruct 8B parameter models. Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI. Lots of wins with Rnj-1. 1. SWE bench performance close to GPT 4o. 2. Tool use outperforming all comparable open source models. 3. Mathematical reasoning (AIME’25) nearly at par with GPT OSS MoE 20B. ….

103

2K

166

609

630K

2

16

0

818

pcmonk retweeted

Essential AI

@essential_ai

6 months ago

Today, we’re excited to introduce Rnj-1, @essential_ai's first open model; a world-class 8B base + instruct pair, built with scientific rigor, intentional design, and a belief that the advancement and equitable distribution of AI depend on building in the open. We bring American open-source at par with the best in the world.

essential_ai's tweet photo. Today, we’re excited to introduce Rnj-1, @essential_ai's first open model; a world-class 8B base + instruct pair, built with scientific rigor, intentional design, and a belief that the advancement and equitable distribution of AI depend on building in the open.

We bring American open-source at par with the best in the world.

35

1K

153

442

606K

Philip Monk

@pcmonk

6 months ago

@joji_teira I wasn't around for Wang, so I just used a trillion flops to save a trip to wikipedia

0

1

0

18

Philip Monk

@pcmonk

6 months ago

You all have it so easy today with your petaflop gpus. In my day we had *floppy disks* that could only handle a few hundred kiloflops/s

1

0

190

Philip Monk

@pcmonk

7 months ago

It finally happened: I ran into a bug that rust would have caught

Philip Monk

@pcmonk

10 months ago

The things that are hard about ml infra are not things that rust solves

3

8

0

1

1K

0

4

0

1

279

pcmonk retweeted

Essential AI

@essential_ai

9 months ago

[1/2] We at Essential are driven by mission to advance fundamental research guided by first principles, rigor and sharing research openly.

1

30

10

2

5K

Philip Monk

@pcmonk

10 months ago

@hastuc_dibtux I've not looked into those much. It seems a very tall order, and my first question would be what's their story for flash attention

0

1

0

71

Philip Monk

@pcmonk

10 months ago

The things that are hard about ml infra are not things that rust solves

mattparlmer 🪐 🌷

@mattparlmer

10 months ago

The fact that Python is the standard for machine learning is a serious indictment of the field’s engineering standards

758

14K

641

3K

2M

3

8

0

1

1K

Philip Monk

@pcmonk

10 months ago

@hastuc_dibtux I wish I knew, because I'm not satisfied with what I'm doing (juggling ymls overriding parts of files like this: https://t.co/euBCzA38or)

0

1

0

73

Philip Monk

@pcmonk

10 months ago

@hastuc_dibtux Flash at its core is "just" fusion, but it requires a lot more knowledge than just the shapes, which is what compilers usually get

1

0

101

Philip Monk

@pcmonk

10 months ago

@hastuc_dibtux The performance difference between flash v2 and v3 is like double, and that's just adding like ping pong scheduling. It's a deep rabbit hole and also completely non-optional for anything at scale.

1

0

108

Philip Monk

@pcmonk

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users