Our UMA is a unified model for object motion and robot action that learns from heterogeneous data sources using 3D object-motion trajectories as a shared interface.
Check it out: https://t.co/Nha6IiKW5O
Introducing Unified Motion-Action (UMA) Model, a robot foundation model that uses 3D object motion as a shared interface for heterogeneous robot learning. UMA treats motion and action as co-evolving variables, enabling knowledge transfer across data sources and versatile inference. 🧵 1/n
Modern text-to-image models are increasingly powered by large pretrained LLMs.
But there is a curious mismatch: the LLM typically encodes the prompt only once, while the evolving noisy latent states are handled entirely by a newly trained generative backbone.
Can pretrained multimodal prior participate in the denoising process?
Introducing RepFusion. (1/12)
📄 https://t.co/WbkTtg5M79
🌐 https://t.co/iDHggosNJX
🧐A question I've long been interested in: how can we learn from human hands and transfer that directly to robots?
Our new work, HUG, makes it possible in three simple steps: (1) collect human grasps at scale, (2) learn from them, and (3) retarget for deployment.
Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.
Now it is 0.75, and can be even lower.
Many wonder how.
I thought it might end as a small FID prank: simple and deliberate.
It started with one question: can FID be optimized directly, and what does it reveal?
Introducing FD-loss.
Thanks to AK for sharing our paper!🎉
Training a generative critic model to judge responses makes it BETTER at EVERYTHING. Sometimes the best policy comes from good judgment. Your critic model has been hiding its true potential🌟
🚀Introducing LLaVA-Critic-R1, a family of VLMs that serve as both critic and policy in a single model.
No policy training. No in-domain task data.
Just 40k preference pairs "Is response A or B better?" for Critic RL Training!
Result: +5.7% on 26 visual benchmarks including visual understanding, reasoning, even GUI agents. 71.9 7B-Scale SoTA performance on MMMU!
Learn to judge, excel at everything🎭
📄 Paper: https://t.co/KhDLvWpXVn
💻 Code: https://t.co/UGDWDvCLrk
Sharing our #CVPR2025 paper: "GPS as a Control Signal for Image Generation"! 🛰️+✍️ We turn the GPS tag stored in EXIF of photos into a control signal for diffusion models—so they don’t just know what you asked for, but where you want it to look like.
Come to see our poster at Friday 13 Jun 10:30 a.m. — 12:30 p.m. (CT) in ExHall D, Poster #250.
Excited to share our CVPR 2025 paper on cross-modal space-time correspondence!
We present a method to match pixels across different modalities (RGB-Depth, RGB-Thermal, Photo-Sketch, and cross-style images) — trained entirely using unpaired data and self-supervision.
Our approach learns correspondences through contrastive random walks across visual modalities.
#CVPR2025 (1/6)
Can AI image detectors keep up with new fakes?
Mostly, no. Existing detectors are trained using a handful of models. But there are thousands in the wild!
Our work, Community Forensics, uses 4800+ generators to train detectors that generalize to new fakes.
#CVPR2025 🧵 (1/5)
Ever wondered how a scene sounds👂 when you interact👋 with it?
Introducing our #CVPR2025 work "Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes" -- we make 3D scene reconstructions audibly interactive!
https://t.co/tIcFGJtB7R