For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.
We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.
This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.
Bookmarked something fire on X… then spent 15 minutes scrolling trying to find it again? 😩
We got you fam.
Meet BookmarkBro — a beautiful native Mac app for browsing, super-fast search, tagging, and chatting with AI about your X bookmarks.
All locally on your Mac. No privacy leaks. No new sign-ups. No cloud nonsense.
Free in beta.
More + download in the replies 👇
@DylanWeaver@HinataMotivates Dwarkesh is like a 1b param model doing 1000 tok/s stumbling over in a reasoning loop. Jensen is a 2t param yoda one shotting it
Great interview. Only one codex model runs on cerebras afaik - 5.3-spark. I’ve been testing it - very fast but the quality isn’t great. Tiny context window and not as good overall as 5.4. I think this is because the chip only has 44gb sram. @MatXComputing will have an interesting blend of sram (weights) and HBM (kv cache) and Nvidia will do more with Groq over time for fast inference of some workloads no doubt.
Huang is right in that Nvidia GPUs/CUDA is more general and more future proof for architecture changes than TPUs optimised for current workloads/architectures. He also said that the main reason Anthropic is using TPUs is because Google/Amazon are large investors in them and Nvidia wasn’t able to invest early on - not sure how true that is but was interesting.
China doesn’t have access to latest lithography for competitive power efficiency but will build EUV (or whatever comes after) capabilities eventually, likely in the next decade. They are moving pretty fast elsewhere (models obviously, but also fast 3d DDR5 from CXMT, Huawei etc for processors). I think the chip ban is probably bad long term, might have been better to keep them on nvidia instead of accelerating home grown alternatives
@claudeai this is the way. the executor could be a local model also, or a realtime voice model that does tool calling for complex tasks when needed but doesn't stop the voice conversation
@amix3k@soumo_dg it's a really great model and the optional reasoning and tool calling are great too. but i wonder how this will scale to every user on popular apps unless metered. some day a model like this will run on device
@elonmusk v cool. will AI4/5 be sold separately? And will you be able to use the AI4/5 chip in your car for other inference tasks (like Digital Optimus) while not driving?
This was a great recent interview - https://t.co/LsnwGKdlg2
Good fit for millions of low complexity problems that are still unsolved and are verifiable
Coding is a bit special in the sense that there is potential for RSI - starting to see that with Karpathy’s new autoresearcher, AI optimised CUDA kernels etc
@tomjohndesign This plugin has worked very well for the same task but it’s great to see Figma embrace Claude code more. More excited to see design systems integration and two way flows in the future
https://t.co/ILvXRbeWXc