Despite impressive visuals, current video world models that only generate a single agent’s perspective aren’t modeling a complete world. The complex behaviors that arise from real or virtual worlds do not happen in a vacuum. They arise from interactions among many agents. Instead of modeling a one-agent island, let’s try modeling a multi-agent planet.
This led to our project, Solaris [1/9]
Modern text-to-image models are increasingly powered by large pretrained LLMs.
But there is a curious mismatch: the LLM typically encodes the prompt only once, while the evolving noisy latent states are handled entirely by a newly trained generative backbone.
Can pretrained multimodal prior participate in the denoising process?
Introducing RepFusion. (1/12)
📄 https://t.co/WbkTtg5M79
🌐 https://t.co/iDHggosNJX
What is a good latent space for world modeling and planning? 🤔
Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning.
📑: https://t.co/CCmcEIJGM6
i’m joining forces with @ylecun and an incredible group of people to start AMI Labs @amilabs.
AMI isn’t a conventional lab. we don’t intend to become one.
a lot to say about why this moment matters, but for now we’re heads down building.
join us: https://t.co/zXj1IyBYDc
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision.
We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
[10/9] Solaris is our first foray into multi-agent world modeling. While we saw some interesting and surprising findings, it is more so a solid foundation to enable deeper and further experimentation. I’m particularly excited about the complex multi-agent behaviors and phenomena that emerge from learning on open & dynamic multiplayer environments!
Despite impressive visuals, current video world models that only generate a single agent’s perspective aren’t modeling a complete world. The complex behaviors that arise from real or virtual worlds do not happen in a vacuum. They arise from interactions among many agents. Instead of modeling a one-agent island, let’s try modeling a multi-agent planet.
This led to our project, Solaris [1/9]
[9/9] For more examples, source code, docs, and paper, check out links below!
🔗https://t.co/CndLctcqYc
📄https://t.co/eu5ANfWUfu
This project spanned everything from Java'ing Javascripts (sorry) to docking Dockers and figuring Figmas. I feel truly fortunate to have a group of talented collaborators with diverse collective expertise: @georgysavva, @ojmichel4, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, @SrivatsPoddar, @sainingxie
Working on Cambrian-S has been a genuinely meaningful learning experience.
❤️ I am grateful to all my amazing collaborators throughout this long journey, especially @jihanyang13, @_ellisbrown, @PinzhiHuang, Zihao Yang, Yue Yu,
@TongPetersb, @ZihanZheng71803, Yifan Xu, Muhan Wang, and @fred_lu_443 (also our amazing director‼️).
☺️Thanks to @sainingxie for continuously encouraging us to explore the unknown, pursue crazy ideas, and play the infinite game!
🥰And thanks to all supervisors @sainingxie, @drfeifei, @ylecun for guiding us through the maze.
🌕Mission never ends. Let’s keep building supersensing for superintelligence.
🧵[n/n]
@sainingxie told us to ONLY work on "crazy ideas."
Almost a year ago, we started Cambrian-S because "Supersensing" sounded super crazy.
This crazy idea kept me awake and caffeinated for months.
Today, all that work is live: Cambrian-S is here. So grateful to have built this alongside this incredible team.
Please take a look here.
Hope you find this idea crazy as well!
Website: https://t.co/r4nZiqcMGE
Github: https://t.co/dm882JkGzv
arXiv: https://t.co/Ya32Zhbvrf
Behind Cambrian-S are the passionate researchers that drive it. This video is a presentation, but more so representation. I shot the short as an ode to the very humans behind, and these unique, surprising spaces and memories that are we. Please enjoy! May the experiment go on--
Introducing Cambrian-S
it’s a position, a dataset, a benchmark, and a model
but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶
Introducing Representation Autoencoders (RAE)!
We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)
In our new ICML paper, we show that popular families of OOD detection procedures, such as feature and logit based methods, are fundamentally misspecified, answering a different question than “is this point from a different distribution?”
https://t.co/gcks5PFyPX [1/7]