Last fall, we shared our deep dive on FA4 internals.
But we didn't stop at grokking the kernel.
Since then, we've been developing improvements for inference performance and upstreaming them.
This blog post explains those contributions.
https://t.co/xzDNHdq3Zw
Tesla AI chip design engineering reviews are so great! Team is awesome.
Our AI6 chip might set a record for most amount of usable intelligence from a wafer when factoring in yield.
@MoraKing1788 you are welcome. here we pay rent anually, you pay like $3k to $4k for a two bedroom apartment in a major city anually.
there are offsides tho like a non-existent public transport system and garbage electricity supply.
Personal update: I’ve decided to leave OpenAI. Not that I ever worked there. But it just looks like everyone else is doing it, so I thought I'd hop on the bandwagon.
In other news, I've decided to join @AnthropicAI to work on AGI for the benefit of Claude. I don't think they realize that I've decided to join, and to be honest, I don't think my decision carries much weight with them, since I wasn't offered a job there.
But the decision stands.
I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more.
Wrote the full transformer architecture, and BPE tokenizer from scratch.
The framework features:
- Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput
- Automatic WebGPU fallback for non-NVIDIA devices
- TypeScript API with Rust compute backend
- One npm install to get started, prebuilt binaries for every platform
Try out the model for yourself: https://t.co/TB2itlmCVT
Built with @_reesechong. Check out the repos and blog if you want to learn more.
Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke
cc @sundeep@GavinSherry
It has been more than 6 months (on and off) that I am trying to get upto speed with GPU/TPU kernel development.
IMHO, profiling should be the starting point of learning this topic. You profile, you question, you look for answers and in the process read and imbibe.
I set out on a journey to do just the same. I began profiling gemma4 and was quickly humbled by the amount of information that was at my disposal. The profiler table with huge GEMM names, the profiler trace with too many CPU rows.
To make my life easier, I stepped back and profiled a basic matrix multiplication and addition operation, the weights and bias interaction, as one might see it. The profiler artifacts were simple enough to reason and think through.
In this blog post, I document my journey and in the process uncover how one should profile and what one should look at! I hope this helps beginners (like me) with a starting point of their kernel development and optimization journey.
PS: This is a big blog post, bookmark it and come back to this when you have the time (good weekend read?)