Hey all, I will be at GTC next week talking about all the work my team and I did on large-scale MoE training in JAX on GPUs! We decided early on to have a fully dropless training stack to avoid token dropping. (1/7)
I am hiring highly skilled performance engineers for my team! You will be working on optimising pretraining for models >100B params on O(1000s) of GPUs, and hardware-aligned architecture design. We are cooking a lot of very exciting projects and I can safely say you will have a lot of fun! Link in thread. <3
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces.
So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel.
Megakernels are faster & more humane. Here’s how to treat your Llamas ethically:
(Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)
@wordgrammer Hmm, I think it’s mostly that Python is easy to use as the frontend of your DSL. Python calling out to native code is really simple because of CPython and a simple and relatively unassuming runtime. For these reasons I’d love to see more LISP as it does better on those aspects.
@andersonbcdefg For some absurd reason 95% of programmers only go for frameworks and are scared of anything below.
They somehow don't see that frameworks usually enshittify the whole thing!
@andersonbcdefg Call it shiny toy syndrome. This is also a general pattern I find with hf code. It looks simple at first glance, try do anything serious with it and the spaghetti code hits you.
@ludwigABAP Synergies between model and cli will probably become the differentiating factor. Claude 3.7 supposedly is already better in Claude cli than cursor, will only get worse
@abacaj Although I agree Gemini is in a league of its own right now, the ease with which people switched to Gemini can also happen in reverse. But Google’s massive infra lead is what I’d be afraid of if I were one of its competitors, now that they understand how to build sota models.
I think AI is going to usher in a gold age of infra, not obviate it.
It's just so clear that good CS fundamentals result in better AI built systems. Vibe coding works better with type safety, languages where syntax maps closely to semantics, referential transparency, tight scoping etc.
These approaches have never been widely adopted in CS because they are hard for humans, and in particular novices. But they're not hard for AI. And they map so much better from natural language descriptions.