🚨 Your Embedding Model is SMARTer Than You Think! Single-vector models actually hide powerful multi-vector capabilities in their frozen hidden states. We introduce SMART, a framework that unlocks this ability for SoTA multimodal retrieval. 🧵👇 🔗 https://t.co/UBpQ2y4sXU
🔥Excited to share the first released work from our IEI lab! Congrats to @AnteaWu 🎉
This work is motivated by the lack of quantitative evaluation for physics alignment in video world models. With tools like MegaSam and CoTracker, we can directly reconstruct dynamic 3D scenes, enabling quantitative evaluation of physical alignment.
Both code and data are released — feel free to try it out! It should work, but if it doesn’t, contact @AnteaWu directly : )
We are offering grants of $100,000 + Tinker credits to researchers advancing the field of human-AI interactivity. Submit your proposals by June 19th!
https://t.co/907HfBy7g3
We are offering grants of $100,000 + Tinker credits to researchers advancing the field of human-AI interactivity. Submit your proposals by June 19th!
https://t.co/907HfBy7g3
My first share since joining @thinkymachines. Fun working with this team on real-time multimodal interaction. Vision in turn-based models felt like flipping through photos — continuous video is a different problem.
Visual proactivity is essential — grateful to have worked on this alongside @liliyu_lili, @rown , and the rest of the team!
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable open models in the world!
Gemma 4 is build to run on your hardware: phones, laptops, and desktops.
Frontier intelligence with a 26B MOE and a 31B Dense model!
🤯 Upgrade your pretrained visual encoder with <10 lines of code.
This is what vision researchers have ignored:
Can you imagine multiscale upon pixel space can work so well?! Remember, we are not doing multiscale upon feature space!
🏠Project Page: https://t.co/LLXO2Z39lt
📷 Paper: https://t.co/HP058lSQn6
Get uniform improvements upon MLLM, Seg, Depth with similar computation cost.
🔥 Upgrade your frozen vision encoders with <10 lines of code!
Single-scale inference throws away vital details. Enter MuRF 🚀: a simple, training-free plug-in for instant, massive gains in MLLMs, Seg & Depth. 🤯 1/6
Good question, we have efficiency analysis in the paper!
And it is straight forward:
For MLLM: MuRF holds the same number of tokens as as single scale due to its design, leading to the same computation cost in LLM part. Empirically, we observed that MuRF achieves similar VRAM usuage, training and inference time compared to the single resolution for MLLM. The whole thing happens since visual encoder is much smaller than LLM!
Hi Thomas, thanks for the comment! Huge fan of S² and learned upsamplers like AnyUp! 🤝 While we share the goal of multi-scale representation, MuRF takes a fundamentally different path.
TL;DR: We show that simply resizing the whole image (no tiling!) and fusing features creates a universally stronger representation without any learned upsampling heuristics.
Here is the deeper dive into why we are different:
1️⃣ Motivation & Token Budget: We asked: Does higher resolution always mean better features? Surprisingly, no! Low-res provides crucial global context that actually improves high-res performance. For MLLMs, we lift the performance ceiling by a large margin while keeping the exact same number of visual tokens!
2️⃣ Approach (No Tiling, No Bells & Whistles): Unlike S², which cuts images into independent patches (breaking spatial layout and object continuity), we process the entire image at different scales. No complex layout engineering. As for AnyUp, learned upsamplers are great, but our parameter-free bilinear upsampling requires zero training. This guarantees extreme simplicity, maximum flexibility, and prevents generalizability issues.
3️⃣ Universal Application: We aren't just optimizing MLLM token budgets. MuRF is a fundamental, training-free enhancement for visual representations—generalizing flawlessly out-of-the-box across high-level reasoning (MLLMs), dense geometry (Seg/Depth), and even unsupervised anomaly detection.
We believe this simple, holistic multi-scale synergy is a highly promising direction. Let's push toward better visual representations together! 🚀
🔥 Upgrade your frozen vision encoders with <10 lines of code!
Single-scale inference throws away vital details. Enter MuRF 🚀: a simple, training-free plug-in for instant, massive gains in MLLMs, Seg & Depth. 🤯 1/6