[CV] SODA: Bottleneck Diffusion Models for Representation Learning
D A. Hudson, D Zoran, M Malinowski, A K. Lampinen, A Jaegle, J L. McClelland, L Matthey, F Hill, A Lerchner [Google DeepMind] (2023)
https://t.co/OHaC0KObCY
- The paper introduces SODA, a self-supervised diffusion model for both representation learning and image generation.
- SODA consists of an image encoder that distills an input image into a compact latent code, and a conditional denoising diffusion decoder that uses the latent code to guide the image generation process.
- A tight bottleneck between the encoder and decoder encourages the emergence of disentangled and semantically meaningful latent representations.
- SODA is trained with a novel view synthesis objective, where the encoder encodes a source image, and the decoder uses that code to generate a novel, related target image. This acts as a powerful pretext task for self-supervised representation learning.
- SODA incorporates several innovations including layer modulation, modified classifier-free guidance, and an inverted noise schedule to further improve the latent representations.
- Experiments demonstrate SODA's strong performance on downstream tasks like ImageNet classification, its ability to generate high fidelity images and novel views, and the disentangled nature of its latent space.
- The compact bottlenecked design and novel view training objective sets SODA apart from prior diffusion models and establishes its capabilities for both representation learning and controllable image synthesis.
But seriously folks, this a short and juicy tirade in which I say:
(0) there will be superhuman AI in the future
(1) they will be under our control
(2) they will not dominate us nor kill us
(3) they will mediate all of our interactions with the digital world
(4) hence, they will need to be open platforms so that everyone can contribute to training and tuning them.
@soumikkanad@arankomatsuzaki (And lastly, as an unofficial side note, while of course the date of the publication is totally what counts, we actually got the Imagenet score in end of Feb and the paper publication got delayed a lot because of my PhD graduation/thesis writing.. ๐)
@soumikkanad@arankomatsuzaki In addition to that, I wasn't aware of the diffusion-beats-gans paper while writing, I'll be most happy to add a discussion of it to the paper!
@soumikkanad@arankomatsuzaki Finally, considering the number of parameters is critical for valid comparison. While in SODA we make sure to use model of size comparable to competing methods, the first paper you mention uses 5x more parameters (!) (couldn't find model size details for the second paper).
(3/3)
@soumikkanad@arankomatsuzaki In addition, I believe a key result is that for light data augmentation, our model beats all models we compared to, including both the leading generative and discriminative approaches, such as MAE, DINO, BYOL etc!
(2/3)
@soumikkanad@arankomatsuzaki Hi @soumikkanad , thank you for these references! Note that these works achieve 61.95-63.9% in linear probing, significantly lower than both our SODA and contrastive methods (>72%)
(1/3)
SODA: Bottleneck Diffusion Models for Representation Learning
The first diffusion model to succeed at ImageNet linear-probe classification
proj: https://t.co/FEG1zn873S
abs: https://t.co/mg2wqJwyhY
LLMs obviously have *some* understanding of what they read and generate.
But this understanding is very limited and superficial. Otherwise, they wouldn't confabulate so much and wouldn't make mistakes that are contrary to common sense.
I have argued, since at least 2016, that AI systems need to have internal models of the world that would allow them to predict the consequences of their actions, and thereby allow them to reason and plan.
Current Auto-Regressive LLMs do not have this ability, nor anything close to it, and hence are nowhere near reaching human-level intelligence.
In fact, their complete lack of understanding of the physical world and lack of planning abilities puts them way below cat-level intelligence, never mind human-level.
AR-LLMs can accumulate large amounts of textual knowledge (if only approximately) and can retrieve it with appropriate context (if only approximately). More than a cat, certainly.
But how is that any 10 year-old can learn to clear up the dinner table and fill up the dishwasher in one shot, whereas we are nowhere near having robots capable of learning this in any amount of time.
Obviously, we are still missing something really big to reach human-level AI.
I have written where I think AI research should go over the next decade or two to bridge that gap:
https://t.co/yqWEubV9id
All my talks of the last couple of years have been on "objective driven AI architectures" which are an attempt to bridge that gap while making AI systems controlable, safe, and subservient to humanity. E.g. this one:
https://t.co/2QTDpXWjzy
Today with @YouTube, weโre announcing Lyria: our most advanced music generation model to date. ๐ถ
Weโre also releasing 2๏ธโฃ AI experiments in close collaboration with participating artists and creators to bring their ideas to life responsibly. โ
https://t.co/i9ve66A5rv
Last night, 50 years to the day after the pioneering Intergalactic SpaceWar Olympics first video game contest (https://t.co/urYsg77H7b), current and former members of @StanfordAILab gathered for the 2022 SAIL Gaming Tournament. Everyone had fun, with games old and new.
I always tell my students: if you only read paper published in the past five years, the probability that you will have any ground-breaking idea in your lifetime is nearly zero. The odds is probably less than winning a jackpot on a slot machine in Vegas...