I think of attention as a projection of the natural language zipf-ian spectrum (source side) into the loss landscape of the model architecture (capacity side). Thought of in this way, literally only quadratic attention is capable of achieving perfect projection. Other attention mechanisms are predictably imperfect. And if you agree with the Platonic Representation Hypothesis, with some whitening, different activation geometries are just representational power over the same gauge orbit of the platonic representation!
Using @zeddotdev a bit recently whilst making my own GPUI project. It has given me a lot of respect for the zed team. Rust only UIs are no joke, and GPUI (whilst bloated by the dependent crates) is a hell of a thing to engineer. Also a new found respect for browsers and what web devs take for granted (like scrollbars, views, drag and drop, text select and copy and paste etc..)!
For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.
We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.
This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.
Counterintuitively, I've seen benchmarks that show LLMs perform worse when they use the web on certain tasks. My hypothesis is that zero tool use means you stay closer to base train distribution and somehow access higher model capacity. No idea if others see the same, but easy to test if someone has the tokens to spend.
@willccbb Extremely bullish on this. I think of it as finding the highest leverage problem for the tokens you have access to. My current flavour is the Platonic Representation Hypothesis. If all models land at the same representation, why isn't there a short cut to getting there?
Neural networks might speak English, but they think in shapes.
Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision.
Starting today, we’re releasing a series of posts on this research agenda. 🧵
@AjdDavison I'm finding that if the prior art exists in the model somewhere, you can connect previously disjoint findings and experiment rapidly. Along with very fast data analysis this is a great research tool. But truly novel findings, no. The models still struggle out of distribution
Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us.
https://t.co/zUuvPJGmcq
Love this exploration, and the passion! I'm also convinced there are more fundamental relationships between the data we use, and the models we empirically grow to represent that data distribution. Universal representations are the canary!
1/ Deep learning is going to have a scientific theory. We can see the pieces starting to come together, and it's looking a lot like physics!
We're releasing a paper pulling together these emerging threads and giving them a name: learning mechanics.
🔨 https://t.co/92nSIHameW 🔧
I'm convinced that in the future of agentic coding, Rust will be a clear winner.
Sure the models were built with Python. But agents LOVE type-safe, compile-time-checked, opinionated languages like Rust; its compiler feedback is incredibly clear and helpful. It's like getting strongly verified rewards every time you change a line of code.
The UK is a great country with an extraordinary history. Our stagnation is real, but it's fixable and worth fixing.
Enjoyed giving this talk at @lfg_uk last week and so encouraged by the optimistic responses I've had from people who are building a brilliant future for Britain 🚀
New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Read our blog post here: https://t.co/KhEFIHf7WZ
@MKBHD Imagine a dystopian future where artificial intelligences straight-up lie to your face.
Oh wait, that’s not the future… it’s today. 😥 https://t.co/pT5411BCMZ
Hear this: text-to-image models can only train because of the abundance of alt text throughout the web. The conclusion is unavoidable: If AGI kills us all, it's blind people's fault.
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵
Website: https://t.co/ja2FPw9aad
Paper: https://t.co/1q4fzjJSyZ