"It is by logic that we prove, but by [abstract] intuition that we discover." - Henri Poincaré.
When faced with a complex problem, we pause, we think.
Not exactly in words, not exactly in images —
in something more abstract,
something harder to name.
So, for truly intelligent agents, should we not ask that they do the same?
Introducing Mull-Tokens — a modality-agnostic latent thinking paradigm.
Now, the model can think in space, in time,
in words, in affordances —
in all the things that language alone
cannot easily convey.
https://t.co/4A4l7vED9d
Launching my research group, MAGIC (Manipulation and General Intelligence Control) Lab @NUSComputing, Singapore!
We focus on building the next generation of human-centric models for robotic manipulation — deployable safely, reliably, and easily in the real world. Our research spans MLLM reasoning, 3D vision, robot learning, simulation, dexterous manipulation, and cross-embodiment learning.
Interested in joining? Sign up here and I'll send a reminder email: https://t.co/9lEQnFERuh
Today we're releasing MolmoWeb, an open source agent that can navigate + complete tasks in a browser on your behalf.
Built on Molmo 2 in 4B & 8B sizes, it sets a new open-weight SOTA across four major web-agent benchmarks & even surpasses agents built on proprietary models. 🧵
This work would not be possible without all my amazing collaborators: Ahmed Abdelkader, Chengzhi Mao, Bryan Plummer, @kate_saenko_ , @RanjayKrishna , Leonidas Guibas, and Vincent Chu!
"It is by logic that we prove, but by [abstract] intuition that we discover." - Henri Poincaré.
When faced with a complex problem, we pause, we think.
Not exactly in words, not exactly in images —
in something more abstract,
something harder to name.
So, for truly intelligent agents, should we not ask that they do the same?
Introducing Mull-Tokens — a modality-agnostic latent thinking paradigm.
Now, the model can think in space, in time,
in words, in affordances —
in all the things that language alone
cannot easily convey.
https://t.co/4A4l7vED9d
As conversations continue around grounding visual & textual reasoning, we believe latent, modality-agnostic thinking could be a promising direction.
The latents can be extended to anything - trajectories, 3D point-cloud features, audio!
Paper, code, and models posted. Dive in and let us know what you build! 🚀
🚀 Excited to share that my team at Meta just launched Segment Anything 3! SAM 3 doubles the performance of existing models on open-vocabulary instance segmentation on our new SA-Co benchmark, with 207K unique object labels. Huge congrats to the team, so proud of this work!
🌶️ hot take 🌶️
> we should normalize training on the test set
yes, you read that right.
no, I'm not joking.
and, yes... I have taken ML 101
👉 here's why this is crucial for future multimodal LLM research [1/n] 🧵
SIMS-V offers free (simulated) rich accurate video annotations for object relationships, distances, and temporal tracking—capabilities often lacking in existing video training datasets. 🎞️💫
Mix it into your data and boost your model's performance on video reasoning tasks!
Code and data are open! https://t.co/gU2ODaoJhU
MLLMs are great at understanding videos, but struggle with spatial reasoning—like estimating distances or tracking objects across time.
the bottleneck? getting precise 3D spatial annotations on real videos is expensive and error-prone.
introducing SIMS-V 🤖
[1/n]
MLLMs are great at understanding videos, but struggle with spatial reasoning—like estimating distances or tracking objects across time.
the bottleneck? getting precise 3D spatial annotations on real videos is expensive and error-prone.
introducing SIMS-V 🤖
[1/n]