When someone says "we need theory of deep learning", note that probably nothing will "count" as a theory of deep learning unless it is *their* theory, or unless it is speaking a language, and using techniques, that they already have a bias towards.
Why don't we train LLMs with Q heads? Training a head to predict the models loss during pre-training could be useful later, and even if not, it could be good signal to propagate back like MTP. Is there a reason we don't do this?
The way i think about it is (which may be totally wrong), clean data does not give crazy grad norms. Grad clipping is more for really bad batches that somehow slipped in, like just repeating tokens and other artifacts of web data. So it’s there for “emergencies”, and if your data is clean it won’t do anything at all. But this may be wrong, an easy way to dive deeper into this would be to just manually examine which batches trigger the grad clipping.
Interesting, the whole multiple specialized experts then OPD thing sounded overengineered when i first read it, ill definitely take a deeper look now.
Seems counterintuitive to the idea that generalized models can transfer skills between tasks, and there potentially being a general but sparse “high iq” signal shared between tasks. Seems not AGI pilled, if that makes sense.
@kalomaze Are you suggesting this assumption is holding back algo design, or that we should design different tasks that have denser/continuous reward signals?
@0xAX For me it was good to get some exposure to every layer of the stack even if not all together in one project. It'd be good to do mini projects for the lower layers even if you build on a higher level for your main projects.