For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.
We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.
This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.
Harmonious Geometry: The Hirajoshi Wave.
Watch as these gravity-defying spheres trace the hauntingly beautiful paths of the C Hirajoshi scale.
Each ball is tuned to a specific frequency within this traditional Japanese pentatonic scale (C, D, Eb, G, Ab), creating a mesmerizing "Polyrhythmic Pendulum" effect. As the balls oscillate at slightly different speeds, they drift into chaotic patterns before perfectly realigning into a breathtaking visual and auditory climax.
From the sharp, angular bounce to the fluid, sweeping curves of the rainbow trails, this is where physics meets fine art.
Credit: project.jdm
Google DeepMind just solved one of the dirtiest problems in image generation. and the fix is almost embarrassingly elegant 🤯
every diffusion model you've used (Stable Diffusion, Flux, etc.) relies on latent representations. an encoder compresses images into a compact space, and a diffusion model learns to generate in that space.
the problem nobody talks about: how you train that encoder is basically vibes.
the original Stable Diffusion approach slaps a KL penalty on the encoder with a manually chosen weight. too much regularization and you lose high-frequency details. too little and the latent space becomes chaotic for the diffusion model to learn from.
everyone just... picks a number and hopes for the best. it's the equivalent of tuning a radio by feel while blindfolded.
DeepMind's paper reframes the entire question.
instead of treating the encoder and diffusion model as separate stages, they train them together. the encoder's output noise gets directly linked to the diffusion prior's minimum noise level. this one connection turns the messy KL term into a simple weighted MSE loss, and gives you something you've never had before: a tight, interpretable upper bound on how much information your latents actually carry.
think of it like this. before, you were compressing an image and praying the compression ratio was "about right." now you have an actual dial that tells you exactly how many bits of information are flowing through, and you can set it precisely.
the results speak for themselves. FID of 1.4 on ImageNet-512 with high reconstruction quality, using fewer training FLOPs than models trained on Stable Diffusion latents. on Kinetics-600 video, they set a new state-of-the-art FVD of 1.3.
but the real contribution isn't the numbers. it's that they turned one of the most heuristic-heavy parts of the generative AI pipeline into something principled. the trade-off between "easy to learn" and "faithful reconstruction" was always there. this paper just made it visible and controllable.
the uncomfortable implication for everyone building on frozen Stable Diffusion encoders: you've been optimizing everything except the foundation.