@maharshii what is your workflow for writing kernels? i.e. if you wanted to write a GDN kernel and you didn't knew anything about GDN, how would you start? and also what excites you about writing kernels?
I see a lot of enthusiasm about building sovereign models on my timeline.
That's great to hear and India needs it, BUT.. building a Fable-class model is a compute and funding game. Last I checked, India had ~50-100k H100 equivalents while frontier labs would have a million each.
Unless we have a paradigm shift in how AIs are trained, the conversation ought to be happening about amount of funding available to do what we want to do. Show me an Indian company that's secured funding/compute in the same range as that of Chinese AI labs (let alone American labs).
Without compute, what will happen is what has happened before: we'd promise to shake the world and then build models that are a year or two behind the top ones.
The path forward for sovereign models that I see is to invest in basic R&D so we have a chance to go beyond the current paradigm, OR the government pooling in several orders of magnitude more compute to seriously commit competing at par.
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.
The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance.
Access to all other Claude models is not affected.
We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.
Read our full statement: https://t.co/bwn0sximKZ
People should get smarter at a rate sufficient to integrate their old experiences, but not so much smarter so fast that they can't integrate their new intelligence. Being smarter means you get bored faster, but you can also tackle new challenges you couldn't understand before.
"What I cannot create, I do not understand."
Introducing: The Feynman GPU Lectures.
Your H100s and B200s are running at a fraction of their peak utilization because your custom kernels are written with massive hardware bottlenecks. If you don't know what tcgen05. mma does at the wire level, you're lighting compute efficiency on fire.
Register files used to be the ultimate bottleneck for Tensor Core accumulators. Introducing Blackwell’s Tensor Memory (TMEM), a completely new address space inside the SM that isolates the accumulator entirely from the register file.