apaz @apaz_cli - Twitter Profile

Pinned Tweet

3 months ago

Releasing mlsweep, a sweep scheduler and visualizer for distributed ML training. It aims to make launching runs across groups GPUs frictionless and achieve near feature-parity with wandb. But you can use it with whatever frameworks or loggers you like, wandb included.

apaz_cli's tweet photo. Releasing mlsweep, a sweep scheduler and visualizer for distributed ML training. It aims to make launching runs across groups GPUs frictionless and achieve near feature-parity with wandb. But you can use it with whatever frameworks or loggers you like, wandb included. https://t.co/js8RXzqm2X

2

40

7

16

4K

apaz

@apaz_cli

about 3 hours ago

@rosinality I think that stress about high versus low rank is in some sense unfounded, seeing as many shifting low rank updates constitute a high rank update. Then again this must also be validated, RL tends to just learn small subnetworks and does not do this at all. It's not like ReLoRA.

1

0

153

apaz

@apaz_cli

1 day ago

@Norapom04 Blue line is a lot better than some others. Anything going south is worse. And some stations are much better than others. On the train you should be good. People do get mugged at blue stations sometimes. I haven't gotten in any fights, it's mostly been vagrants calling me slurs.

0

1

0

157

apaz

@apaz_cli

3 days ago

@_ueaj @TaliaRinger Importantly, no.

0

1

0

34

Who to follow

soy hombre fetichistas de las medias y leggins brillosos me encanta usarlas y sentir la sedosidad de las medias y de la leggins ajustados y brillosos

3 days ago

I love how any the normalized length of any sufficiently large vector of normally distributed random variables converges to 1 by the law of large numbers. This lets you clip ES gradient projections for cheaper because you don't have to do a reduction to compute the length.

0

1

0

166

apaz

@apaz_cli

4 days ago

@cs_serdar It's because I'm working on Zeroth Order Optimization (Evolution Strategies, SPSA, MeZO, etc) where there are no gradients and so no gradient summing or accumulation at all, so it's way easier and kinda just falls out of the math that it's possible to do at near zero cost.

0

1

0

24

apaz

@apaz_cli

7 days ago

Suppose in deep learning that we didn't do gradient accumulation/batching and we had full knowledge of what components of the grad came from what item. In that case would it make sense to apply grad clipping per-item? One preserves information, the other bounds step magnitude.

1

0

153

apaz

@apaz_cli

5 days ago

@tenderizzation @LLMenjoyer

1

3

0

48

apaz

@apaz_cli

5 days ago

@nyxkrage @stochasticchasm @opencode That is not nearly the scale of difference you'd expect. It's meant to fill up the entire context, basically.

1

0

19

apaz

@apaz_cli

6 days ago

PSA that despite the Deepseek V4 paper clearly stating that you must set a specific system prompt, @opencode does not set this prompt and so you cannot use "max" mode in opencode, even if you think you have selected max mode.

apaz_cli's tweet photo. PSA that despite the Deepseek V4 paper clearly stating that you must set a specific system prompt, @opencode does not set this prompt and so you cannot use "max" mode in opencode, even if you think you have selected max mode. https://t.co/IgfZeW1voK

2

0

1

651

apaz

@apaz_cli

5 days ago

@stochasticchasm @opencode If so I can find no evidence of that being the case. And from my experimentation, no, it behaves way differently. You ask it a question and it pretty much fills up the whole context every time, you walk away and come back in 20 minutes. But that's not what it does in opencode.

2

1

0

62

apaz

@apaz_cli

5 days ago

@cs_serdar I think I understand better now. You are throwing some sort of signal away to avoid falling off the loss landscape. But is it better to throw the bad data point away because it's bad, or if it's not, should you be throwing data out uniformly with a full clip? That's my intuition.

1

0

14

apaz

@apaz_cli

5 days ago

@tenderizzation @snwy_me Honestly, as long as it's not batchnorm.

0

2

0

54

apaz

@apaz_cli

6 days ago

I think overall I like it, although the tone could probably use some work. Sometimes one of the strawmen makes you think "oh shit oh fuck you're so right I need to rethink everything thank you Claude what could I ever do without you" in a way that absolutely was not true of previous models. I expect they'll fix the tone in the next release, but I'm glad we have 4.8. It instantly started finding bugs, and I think it's going to massively help code maintainability going forward.

0

173

apaz

@apaz_cli

6 days ago

I understand not wanting to do SFT. There are real gains from not doing it. But the choice to pretrain without pretraining on things that look like rollouts is confusing to me. There is a big distribution mismatch there, as big as base model vs instruction tune. IMO it should be a part of midtraining. There are ways to produce data that looks like a rollout without actually sampling a rollout.

Rosinality @rosinality

6 days ago

Many choices here are only possible when your objective is not an immediate or short-term performance. Pretraining without synthetic data, posttraining without SFT with data from other LLMs. (And other good choices like scaling ladder with NLL instead of benchmark scores).

1

92

6

24

10K

0

1

0

1

214

apaz

@apaz_cli

6 days ago

I'm actually so thankful that RWKV exists. I can't name anyone else that has carried RNNs forward (mamba doesn't count)

0

2

0

68

apaz

@apaz_cli

7 days ago

I've been thinking about this, yeah. I think you're right. You're throwing away almost whole batches because some data in it happened to be OOD. Although maybe that's not the right framing? I think bounding step magnitude probably is independently useful, but in a world where you have clean batches it shouldn't be a problem. Anyway we don't have clean data. Is per-item clipping enough for stabilization? Or is bounding step magnitude what you want to do? Wondering how other people think about it, because the clankers have failed me. I'm in a paradigm where I have the ability to do this basically for free.

1

0

21

apaz

@apaz_cli

8 days ago

@vmfunc It's so difficult to explain to people my opinions about scientific accelerationism without first explaining plant alien erotica.

apaz_cli's tweet photo. @vmfunc It's so difficult to explain to people my opinions about scientific accelerationism without first explaining plant alien erotica. https://t.co/5LSkg6sc1a

1

4

0

147

apaz

@apaz_cli

8 days ago

You can tell Opus 4.8 is honest because it keeps telling you it's so honest.

1

7

0

194

apaz

@apaz_cli

9 days ago

I will hand it to Anthropic, the model released, I switched to it, and it instantly found a bug. It also instantly started annoyed me. An adjustment for sure.

Taelin

@VictorTaelin

10 days ago

300k tokens trying to teach 4.8 how Bend's termination checker works 🫠 maybe not so bright, but somehow a pleasure to talk to and definitely my favorite model of all time

VictorTaelin's tweet photo. 300k tokens trying to teach 4.8 how Bend's termination checker works 🫠 maybe not so bright, but somehow a pleasure to talk to and definitely my favorite model of all time https://t.co/OMpqcotcSq

14

276

5

36

21K

0

7

0

471

apaz

@apaz_cli

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users