during my time learning about contrastive/self-supervised learning, it always felt mythical on how it exactly works and what mechanism it introduce.
I created these two blogs to explain my learning during the past few years, and simplify the concepts
links below
For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme:
Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!
1/
Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
Lloyd R. Welch (1974). Lower Bounds on the Maximum Cross Correlation of Signals. IEEE Transactions on Information Theory, 20(3), 397–399.
John J. Benedetto and Matthew Fickus (2003). Finite Normalized Tight Frames. Advances in Computational Mathematics, 18(2–4), 357–385.
Vardan Papyan, X. Y. Han, David L. Donoho (2020). Prevalence of Neural Collapse During the Terminal Phase of Deep Learning Training. Proceedings of the National Academy of Sciences (PNAS), 117(40), 24652–24663.
Tongzhou Wang and Phillip Isola (2020). Understanding Contrastive Representation Learning Through Alignment and Uniformity on the Hypersphere. ICML 2020 (PMLR 119), 9929–9939. arXiv:2005.10242.
“make the latent space better” is the vaguest advice in ml.
but there’s a precise answer, and it predates deep learning by decades.
a good latent space is the solution to a sphere-packing problem.
the optimum has a name.
new post + a runnable jax companion 🧵
during my time learning about contrastive/self-supervised learning, it always felt mythical on how it exactly works and what mechanism it introduce.
I created these two blogs to explain my learning during the past few years, and simplify the concepts
links below
@BangachevKiril what i'm exploring is whether image representations can be modeled as a fiber bundle, where text captures the semantic base space and image-specific details are encoded in the fibers
text would only need to approximate the shared semantic structure rather than the full image rep
@BangachevKiril yyyep,
what i understand so far is that fully closing the modality gap may remove modality-specific information, especially from images.