@GaryMarcus I don't fully understand how the assumption was that more experimenting due to AI -> more successes in a linear fashion. The high cost of programming is what served as a prioritisation gate and generally got rid of most of the bad ideas.
Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
https://t.co/c9AvsRKybj
What if we didn’t have to hold an entire neural network in memory to train it?
Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.
In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.
With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.
How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.
We validated this across five different architectures:
• ViT
• DiT
• Masked diffusion
• Autoregressive transformers
• Recurrent-depth transformers
In each case, performance is competitive with end-to-end training while using a fraction of the memory.
This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.
Read our paper and code, to learn more.
Paper: https://t.co/CRj96VGYQn
GitHub: https://t.co/eNW0K9Xh8E
🐟
@calvinfroedge It's enabled folks with no knowledge of a subject have an opinion on a subject with no work required. It speaks to human laziness and that makes it dangerous.
I've argued that human-LLM pairing isn't going to produce new breakthroughs systematically, and forcing myself to put pen to paper gave me a few ideas why. Such novelty is likely going to need a machine-led approach. Human-LLM pairing won't work. https://t.co/AdOzqabvIs
@ID_AA_Carmack All I've seen use (contextual) bandits or similar constructs without credit assignment. a) inference-time performance is paramount, b) offline policy evaluation using data generated by any policy is well understood (rejection sampling, SN-IPS or doubly robust estimators).
@timneutkens@rauchg@vercel@v0@nextjs Tried, out of the box next to no help so I'm checking traces using your tips to understand what I'm doing wrong. https://t.co/UotleRvVyO
@bernhardsson@modal_labs I seem to remember this being done as a MIP in Kubernetes for determining what server to run a container on to satisfy CPU/mem/availability requirements. Hard-constraint optimisation problems pop up in very interesting places!
Props once again to @analogue for making a beautiful product in the Pocket. Very refreshing in the age of shoddy build quality and endless digital subscriptions.
Shockingly, Elon Musk's strategy of berating and coercing private companies into handing over advertising revenues, rather than making Twitter attractive to them, appears to be backfiring in remarkable new ways, according to the Financial Times:
https://t.co/a25mAFDmDj
Reading up on the places in London black cabs are restricted from accessing. Ridiculous. Every time I visit, I take a black cab – it's safe, reliable, and drivers spent more time studying than I did doing my master's degree. They should be allowed on every road and junction.
@shubham12et1062 @bernhardsson I'd probably use a library since you get other things for free like offline evaluation and N different algorithms to play around with.