Excited to share our new work: “Learning to See Before Seeing”! 🧠➡️👀 We investigate an interesting phenomeno: how do LLMs, trained only on text, learn about the visual world?
Project page: https://t.co/9mQt3qnckL
Introducing VGGT-Ω: scaling feed-forward reconstruction across static and dynamic scenes, and studying whether the learned geometric representations transfer beyond reconstruction.
@phillip_isola PRH fan here. Beyond measurement metrics, just sharing my feelings. When working with LLM + vision (understand & gen), we can feel the Platonic behind it. it’s less about one modality replacing others, but about high rep similarity making cross-modal learning so much easier.
@TongPetersb@sainingxie@ylecun@mengyer@YiMaTweets@LukeZettlemoyer@liuzhuang1234 Our friendship began during our PhD application cycle back to 2022, and there’s hardly any need to say more about how amazing your work has been. Behind all of it is an incredibly hardworking person, with strong belief and a very kind heart. Congrats, Dr. Tong!!!
AMI was founded by amazing researchers I have deeply respected since the very beginning of my research career. They have built fundemental things:
before LLM era (SSL, JEPA, moco, mae, barlow twins, arch from LeNet to more recently resnext, convnext...),
during LLM era (gpt, dalle, gemini, Cambrian, rae, beyond language modeling...),
and AMI will undoubtedly open the path of building advanced, world-centric intelligence that shape the future!
i’m joining forces with @ylecun and an incredible group of people to start AMI Labs @amilabs.
AMI isn’t a conventional lab. we don’t intend to become one.
a lot to say about why this moment matters, but for now we’re heads down building.
join us: https://t.co/zXj1IyBYDc
AMI was founded by amazing researchers I have deeply respected since the very beginning of my research career. They have built fundemental things:
before LLM era (SSL, JEPA, moco, mae, barlow twins, arch from LeNet to more recently resnext, convnext...),
during LLM era (gpt, dalle, gemini, Cambrian, Jepa, rae, beyond language modeling...),
and AMI will undoubtedly open the path of building advanced, world-centric intelligence that shape the future!
We believe the next leap in General Intelligence lies beyond language. Vision holds an untapped ocean of potential for true world modeling. By training all from scratch, we show that vision can play a more foundational role in intelligence, rather than just being an add-on!!
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision.
We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Humans communicate through language and interact with the world through vision, yet most multimodal models are language-first. What happens when we go beyond language? 🤔
Beyond Language Modeling: a deep dive into the design space of truly native multimodal models
Paper: https://t.co/KOpmL1PItn
Project: https://t.co/Oy6XuEtUAi
@DavidJFan I know this is Truly a massive challenge that went far beyond only the technical aspects.
Thank you, David, for driving this forward and delivering such incredible work. Absolutely amazing!!🤩
It was really a high-stakes bet and a challenging journey. Huge congrats to the amazing team, especially Peter @TongPetersb , David @DavidJFan , and John @__JohnNguyen__ for their incredible work in leading the project! You are da best!!!
It covers many design spaces in unified pre-training, from visual rep and arch to world modeling and scaling. Each part has very useful findings backed up with tons of explorations. Crucially, we show that vision and language are highly complementary, a synergy for intelligence.
Excited to share our new work: “Learning to See Before Seeing”! 🧠➡️👀 We investigate an interesting phenomeno: how do LLMs, trained only on text, learn about the visual world?
Project page: https://t.co/9mQt3qnckL