Excited to share T-Rex: Tactile-Reactive Dexterous Manipulation 🦖🤖
Touch is fundamental to human dexterity, yet most Vision-Language-Action (VLA) models either ignore tactile feedback or lack the ability to react to high-frequency contact signals.
In this work, we tackle both the data and architectural challenges of tactile-reactive dexterous manipulation.
🦖 A 100-hour tactile-synchronized dexterous manipulation dataset with 7,700+ trajectories, 22 motor primitives, and 200+ everyday objects.
🦖 A tactile-reactive MoT architecture with spatial-temporal tactile encoding and asynchronous high-frequency tactile refinement.
🦖 A scalable training recipe combining 22,889 hours of human egocentric pretraining with tactile-grounded robot mid-training.
Across 12 real-world contact-rich manipulation tasks, T-Rex achieves over 30% higher average success rate than the strongest baseline.
We are fully open-sourcing the dataset, models, teleoperation stack, training code, and inference pipeline.
🌐 Project: https://t.co/AiHKRR8YXU
📄 Paper: https://t.co/mXY2UNLlqc
💻 Code: https://t.co/7skCxUtwKC
🤗 Dataset: https://t.co/uNwW8dcRZL
🧵 Thread ↓
fun fact: tijdens de keynote hakt Apple een stukje 3k, 4k, 5k en 6kHz eruit wanneer ze "Siri" zeggen, zodat niet iedereens HomePods terug beginnen te praten 🗣️🚫
Humans can see in high-res, high-FPS in real-time. Why can't VLMs?
Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos.
📄 https://t.co/GhbWZwMAg7
🌐 https://t.co/mEJ991MAIR
🤗 https://t.co/FOfc2QRThi
(1/n)🧵
📢 Deadline Extension for MMFM Workshop @ #CVPR2026!
We are extending the submission deadline to **March 14, 2026 (AoE)**. For updated details on submission timelines and guidelines, please refer to the workshop website and OpenReview page below. We’re excited to see your work!
The 5th edition of the MMFM Workshop is coming to @CVPR 2026!
"What is Next in Multimodal Foundation Models?" exploring the frontiers of vision, language, and beyond.
June 2026 | Denver, CO
Details in thread 👇
🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42@aomaru_21490)
🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to.
🧵[1/8]
✨Thinking with Blender~
Meet VIGA: a multimodal agent that autonomously codes 3D/4D blender scenes from any image, with no human, no training!
@berkeley_ai#LLMs#Blender#Agent 🧵1/6
Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨
1 point + a continuous granularity slider = the mask you want!
UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪
1/n
The Illustrated NeurIPS 2025: A Visual Map of the AI Frontier
New blog post!
NeurIPS 2025 papers are out—and it’s a lot to take in. This visualization lets you explore the entire research landscape interactively, with clusters, summaries, and @cohere LLM-generated explanations that make the field easier to grasp.
Link in thread!
Arxiv has been such a wonderful service but I think this is a step in the wrong direction.
We have other venues for peer review. To me the value of arxiv lies precisely in its lack of excessive moderation.
I'd prefer it as "github for science," rather than yet another journal.
Chinese doordash dropping MIT license foundation video models???
“We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across Text-to-Video, Image-to-Video, and Video-Continuation generation tasks.”
https://t.co/jPTY2Uac1S
Humans handle dynamic situations easily, what about models?
Turns out, they break in three distinct ways:
⛔ Force Stop → Reasoning leakage (won’t stop)
⚡️ Speedup → Panic (rushed answers)
❓ Info Updates → Self-doubt (reject updates)
👉Check out https://t.co/vr7f2ZYMTp
✨Introducing ECHO, the newest in-the-wild image generation benchmark!
You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them!
We distilled this qualitative discussion into a structured benchmark.
🔗 https://t.co/wJmmEY8TFQ
I'm at #ICCV2025 this week - send me a DM or email if you'd like to find a time to talk anything multimodal!
Speaking of multimodal, don't forget to check out our workshop: "What's Next in Multimodal Foundation Models?" on Monday in 326 B!
https://t.co/2BjLLfag9y
🌺 Join us in Hawaii at ICCV 2025 for the workshop
“What is Next in Multimodal Foundation Models?”
🗓️ Monday, October 20 | 8:00 – 12:00📍Room 326 B
We’ve got a stellar lineup of speakers & panelists— details here: 🔗 https://t.co/t2DmcZAlWM
@ICCVConference
🚀Excited to share that our paper, “Do What? Teaching Vision-Language-Action Models to Reject the Impossible,” has been accepted to #EMNLP2025 Findings!
📄Paper: https://t.co/TDamXwV5i8
🌎Project page: https://t.co/QIK01vk60b
I'll be in Vienna for ACL starting Today - I’m presenting work on how LMMs perform in-context updates in a Bayesian way, but I’m excited to talk anything multimodal! Feel free to reach out if you’re around! #ACL2025
Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them.
We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel.
But what if the real bottleneck isn’t size—but depth?What if the model just didn’t have enough serial steps to get it right? Some problems need depth, not width.
This is the Serial Scaling Hypothesis.
This is not the same as recent studies in scaling test-time compute, which focus on train vs. test and are agnostic to parallel vs. serial.
For example: test-time majority voting increases compute by running models in parallel — but doesn’t help when the task itself is serial.
We argue: what really matters is how the compute is structured. And for many real-world problems, it must be serial.
Read more at: https://t.co/msytYszWK0 or 🧵.
(In collaboration with: @layer07_yuxi , Kananart Kuwaranancharoen and @YutongBAI1002 )
Me (To Cursor): Refactor this code.
Cursor: Sure! I've refactored your code! It's shorter and cleaner now!
Me: Are you sure there are no feature regressions?
Cursor: The code is missing essential functionality.
Me: ....
📢 Call for Papers!
Last chance to hang with the CV crowd in Hawaii 🌴
We're hosting the 4th MMFM Workshop at #ICCV2025 — submit your work on vision, language, audio & more by July 1 🗓️
Also check out the CVPR edition 👉 @MMFMWorkshop
🔗 https://t.co/ZpLDbqIAOy