We taught a brand-new mini-series this year at @SCSatCMU on Modern GPU Programming for ML Systems, as part of the ML Systems course, touching on fun questions like what data layout swizzling is, how to use 3D TMA, and state-of-the-art Blackwell programming. We released a curated online book based on the materials: https://t.co/5ZJg2lySNO check it out
A bit of news: After nearly 9 years, I have decided to leave Google DeepMind and join Anthropic (after taking some time to recharge). I am incredibly grateful for my time at GDM. @demishassabis took a real chance letting me lead the AlphaFold team just six months after finishing my PhD, and the entire GDM team taught me so much about how to do great science. GDM is a special place, and I’ll still be excited to hear about what amazing things they discover next.
Here is the technical report on SubQ 1.1 Small.
https://t.co/bu8AEc4lsk
This is the second iteration on our Subquadratic Sparse Attention (SSA) model, and the first to be deployed with design partners in the coming weeks.
The results are compelling and verified by @AppenResearch.
- Near-perfect long-context retrieval up to 12M tokens on the needle-in-a-haystack test, with up to nearly 1,000x attention compute reduction.
- A balance of long-context optimization and general reasoning ability, with strong performance retained across knowledge, coding, and non-coding enterprise agent benchmarks.
- At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
These results highlight a significant scaling advantage thanks to the efficiency gains from the SSA architecture.
We included some details and learnings from the development process which may be helpful to the community.
Comment with questions, I’ll try to respond!
We're thrilled to announce that we have raised $234M in the first close of our $300M Series B at a $1.5B valuation.
@HCLTech and @BessemerVP have joined us in this round, alongside continued support from @khoslaventures and @peakxvpartners
For countries and companies, sovereign control on the AI stack is no longer an optionality. Sarvam will be the partner of choice for this aspiration. The capital allows us to accelerate our momentum towards this full stack of models, compute, and deployments.
A huge thank you to our customers, partners, investors, and the Sarvam team for your trust and belief in what we are building. We’re just getting started.
Read more: https://t.co/VmLtpnj8gx
Fable 5 is out now
but before that, it had its model card updated:
https://t.co/JVuxGZRFdN
The doc's changelog is mostly accurate this time but eg missed mentioning the removal of this footnote:
"This threshold maps to the High-stakes sabotage opportunities threat model in our current Responsible Scaling Policy."
I do not want to do AI research that is reactive to what these companies are doing, or even what they're saying.
The entire field keeps chasing after product releases. Some spend more time reading marketing copy than their colleagues papers and I just... do not want to do that?
one of the quotes i find most inspiring on a hard day:
"Whatever your hand finds to do, do it with all your might, for in the realm of the dead, where you are going, there is neither working nor planning nor knowledge nor wisdom"
Ecclesiastes 9:10
Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
https://t.co/c9AvsRKybj
What if we didn’t have to hold an entire neural network in memory to train it?
Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.
In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.
With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.
How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.
We validated this across five different architectures:
• ViT
• DiT
• Masked diffusion
• Autoregressive transformers
• Recurrent-depth transformers
In each case, performance is competitive with end-to-end training while using a fraction of the memory.
This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.
Read our paper and code, to learn more.
Paper: https://t.co/CRj96VGYQn
GitHub: https://t.co/eNW0K9Xh8E
🐟
I tried @infography_ai built by @r_manoj11. Infography allows you to create visuals/infographics from a blog.
I gave my own blog post as an input, it can able to understand and managed to create stunning visuals.
Supersonic. Mach 1.21.
Quarterhorse Mk 2.1 is now the world’s first privately developed, unmanned supersonic jet and the fastest unmanned aircraft flying today.
This flight makes Hermeus the fastest company in aviation history to go from founding to supersonic flight - exactly 364 days after the maiden flight of our first aircraft.
Now, we fly faster.
A special thanks to @DIU_x, Director @OwenWest91, Maj. Gen. Joe "Solo" Kunkel, and Deputy Director Kyle Norman.
Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
@jino_rohit I was recording my nanochat video when I realized that “first boot up an 8XH100 from your favorite provider!” would instantly get everyone stuck on step 1 of the video