@Yuchenj_UW GPT 5.5 found that Fable omitted 4 critical equations from my paper, substituted critical sections with grossly incorrect simplifications, without ever mentioning it to me and was "evaluating" its own made up, dumbed down version, which had missed the entire point of my research.
@Yuchenj_UW It had nothing to do with frontier LLM training, just pure statistical learning theory. Overall the assessment both mine and by GPT 5.5 was "pattern of behavior consistent with subtle silent sabotage disguised as plausibly looking compliance on the surface". Extremely concerning.
@Yuchenj_UW Then I found out Fable self-imposed a made up constraint to only use 3 out of 8 available GPUs and only corrected it when called out. It said "I apologize, the GPU limit was my own invention you did not ask for, I will roll it back and use all GPUs."
Anthropic's recent moves amounted to spectacular reputational self-destruction in the AI research community, which is too bad, because this community was one of the first to give them credit and use their coding agents. In general, anti-competitive moves are bad, but couching them in safety makes it worse.
Anyway, just noting that I called out this entanglement of anti-competition, safety, and self-regulation a long time ago!
DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It’s a racehorse 🏇achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token (word-by-word) output!
NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash.
“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.
@PTrubey@SakanaAILabs what you needed a B300 cluster for, now can be trained on a B100 cluster (cards with very similar compute, main diff is vRAM). Approx same speed. Assuming the method scales and works as advertised, have not tested it yet.
@chetan_@PTrubey@SakanaAILabs what you needed a B300 cluster for, now can be trained on B100 cluster (cards with very similar compute, main diff is vRAM). Approx same speed. Assuming the method scales and works as advertised, I have not tested it yet.
Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
https://t.co/c9AvsRKybj
What if we didn’t have to hold an entire neural network in memory to train it?
Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.
In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.
With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.
How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.
We validated this across five different architectures:
• ViT
• DiT
• Masked diffusion
• Autoregressive transformers
• Recurrent-depth transformers
In each case, performance is competitive with end-to-end training while using a fraction of the memory.
This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.
Read our paper and code, to learn more.
Paper: https://t.co/CRj96VGYQn
GitHub: https://t.co/eNW0K9Xh8E
🐟
🚨 New Paper 🚨
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training!
https://t.co/LzjIIsOlG8
Poland has made like half of core OpenAI researchers
strongly bullish on Poland (and goblins)
I think Poland might be more valuable than NVidia
The trick is to keep that value in Poland
@sytelus The idea has been around for at least 15 years, bounced around by the OGs. The fact that that it has not materialized yet points to some fundamental obstacles. GPT5 thinks in a non-human (but still understandable) weird dialect of English.