GANs will be solved when continual learning is solved. Disc learning is a continual learning problem. You start by allocating params to broad features and as G improves you have to move to finer feats. But you can't forget the old broad feats as the G can regress to fool you. D solves different problems at different parts of training.
Fixing the loss landscape is one thing but it might need arch changes. Probably a growing MoE type continually learning D.
classic/hinge GAN loss == binary cross entropy
rpGAN/relative GAN loss == bradley terry
mode collapse from the old GAN literature is primarily an artifact of the fact that pointwise scalars aren't grounded in the relative gap between the real and fake samples
ranking > rating
Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
https://t.co/c9AvsRKybj
What if we didn’t have to hold an entire neural network in memory to train it?
Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.
In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.
With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.
How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.
We validated this across five different architectures:
• ViT
• DiT
• Masked diffusion
• Autoregressive transformers
• Recurrent-depth transformers
In each case, performance is competitive with end-to-end training while using a fraction of the memory.
This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.
Read our paper and code, to learn more.
Paper: https://t.co/CRj96VGYQn
GitHub: https://t.co/eNW0K9Xh8E
🐟
Depending on how you implement your sampling, diffusion can have far worse ttft for sequential outputs like creative writing/audio/video. AR is inherently streamable.
In agentic coding this does not really matter as you want to minimize the total time to generate the whole response
@MostlyMonkey@menhguin@DeepDishEnjoyer Out of college at 22 I was making 7x my monthly spend post tax. Not including investment income. This was at at a tech startup at peak zirp but such salaries are sustainable in quant. 5 years of this would buy 30 years of retirement. Add investment incomes and 35 yrs easy.
Sorry for the late release.
NITP (Next Implicit Token Prediction) is about to release, expected next Monday.😀
A new LLM pre-training paradigm that goes beyond next-token prediction by learning the next token’s implicit representation, fixing representation degeneration.