serdarml @cs_serdar - Twitter Profile

When someone says "we need theory of deep learning", note that probably nothing will "count" as a theory of deep learning unless it is *their* theory, or unless it is speaking a language, and using techniques, that they already have a bias towards.

12

196

12

33

11K

Who to follow

HelperWesley - Wishlist Spent Shells

@HelperWesley

GameDev, YouTuber, and occasional Rambler 💖 Wishlist our game: https://t.co/9tFzPH6k2N #Gamedev #GDevelop #Devlog

Kappy | First Bite Studios

@bite_studios

JamesInspirical

@JamesInspirical

Flip Shapes is an endless roller with bouncy pinball physics. App Store https://t.co/E57TIgqJbW & Google Play https://t.co/6DGlW54a1B

serdarml

@cs_serdar

1 day ago

@anshuc @AlexiGlad

0

1

0

206

serdarml

@cs_serdar

1 day ago

@yacineMTB R u having visa issues?

0

134

cs_serdar retweeted

Tilde

@tilderesearch

1 day ago

https://t.co/rmTk8GMkir

7

351

40

339

77K

serdarml

@cs_serdar

1 day ago

@tilderesearch Damn u guys are cooking

0

341

serdarml

@cs_serdar

1 day ago

@HeMuyu0327 So this is why value embeddings/projections work so well in nanochat and adjacent repos.

0

2

0

204

serdarml

@cs_serdar

1 day ago

@yacineMTB @ThePrimeagen Have u tried open sourcing ur drone stuff

0

1

0

149

serdarml

@cs_serdar

2 days ago

@splitbycomma I'm confused abt the word striver, does it mean larper or somebody actually competent and lacking credentials?

1

0

46

serdarml

@cs_serdar

2 days ago

Why don't we train LLMs with Q heads? Training a head to predict the models loss during pre-training could be useful later, and even if not, it could be good signal to propagate back like MTP. Is there a reason we don't do this?

0

1

51

serdarml

@cs_serdar

2 days ago

@itsreallyvivek Huh, i guess my work at SAP may be more relevant than i thought regarding the second point

0

353

serdarml

@cs_serdar

3 days ago

The way i think about it is (which may be totally wrong), clean data does not give crazy grad norms. Grad clipping is more for really bad batches that somehow slipped in, like just repeating tokens and other artifacts of web data. So it’s there for “emergencies”, and if your data is clean it won’t do anything at all. But this may be wrong, an easy way to dive deeper into this would be to just manually examine which batches trigger the grad clipping.

1

0

12

serdarml

@cs_serdar

3 days ago

@luisgnet @creatine_cycle Drugs maybe?

1

0

36

serdarml

@cs_serdar

3 days ago

@kalomaze Hm i understand now, thanks!

0

1

0

58

serdarml

@cs_serdar

3 days ago

Interesting, the whole multiple specialized experts then OPD thing sounded overengineered when i first read it, ill definitely take a deeper look now. Seems counterintuitive to the idea that generalized models can transfer skills between tasks, and there potentially being a general but sparse “high iq” signal shared between tasks. Seems not AGI pilled, if that makes sense.

1

0

92

serdarml

@cs_serdar

3 days ago

@kalomaze Are you suggesting this assumption is holding back algo design, or that we should design different tasks that have denser/continuous reward signals?

1

3

0

132

serdarml

@cs_serdar

3 days ago

@0xAX For me it was good to get some exposure to every layer of the stack even if not all together in one project. It'd be good to do mini projects for the lower layers even if you build on a higher level for your main projects.

1

2

0

335

serdarml

@cs_serdar

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users