Introducing Gaze-LLE, a new model for gaze target estimation built on top of a frozen visual foundation model!
Gaze-LLE achieves SOTA results on multiple benchmarks while learning minimal parameters, and shows strong generalization
paper: https://t.co/Is2NgrrurO
We are grateful to all of the 17,491 reviewers who helped make #CVPR2026 possible. We are especially pleased to recognize the following Outstanding Reviewers, whose high-quality reviews (as judged by their Area Chairs) placed them among the top 5% of reviewers.
Introducing EgoVerse: an ecosystem for robot learning from egocentric human data.
Built and tested by 4 research labs + 3 industry partners, EgoVerse enables both science and scaling
1300+ hrs, 240 scenes, 2000+ tasks, and growing
Dataset design, findings, and ecosystem 🧵
If you’re interested in interactive physical AI, join our workshop at CVPR 2026, sponsored by Nvidia! We will host scientists to discuss socially-intelligent robots, interactive avatars, and on-device multimodal agents 🤖
Papers are due by Feb 28 🗓️
🌟NEW PAPER🌟
Do you know that changing a visual marker from red to blue can completely reorder VLM leaderboards? In our most recent work, we explore the fragility of visually prompted benchmarks. https://t.co/Kck6w7Vvf6
🤔 SAGE started out of curiosity as to why agentic post training had not made a big impact for developing video models in mid 2025.
💡With entertainment videos as our testbed, we find that one can turn great "direct" models into awesome "any-horizon agents" with effortful RL!
Some properties of LLMs only emerge with scale, one of which is the ability to effectively generalize from diverse data.
During my internship @physical_int, we uncovered an emergent property of VLAs: as we scale up pre-training, VLAs can naturally learn from human video data!
We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data!
This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.
🎥 Introducing Split-then-Merge: A new video composition framework!
This approach enables the composition of any foreground video with any background video.
Unlike conventional methods that rely on annotated datasets or handcrafted rules, Split-then-Merge (StM) splits a large unlabeled corpus of videos into dynamic foreground and background layers, then merges them to learn how dynamic subjects interact with diverse scenes.
Work done in collaboration with team members at @Google: Du Tran (@dutran) , Yujia Chen (@IssacCyj) , Prof. Ming-Hsuan Yang (@MingHsuanYang), Vincent Chu: and my advisor at UIUC (@siebelschool): Prof. James M. Rehg (@RehgJim).
I will be attending NeurIPS, San Diego and would be happy to chat more!
🔗Project Webpage: https://t.co/D5UZ4BDi0N
📄Paper: https://t.co/L97S4QpU9m
the Artificial Social Intelligence workshop at #ICCV2025 will be continuing this afternoon! Join us in room 317B at 1:30pm for oral presentations from our paper track, keynotes from Leila Takayama and Michael Black, posters, and a breakout discussion!
https://t.co/Gsf5z18RRu
The #ICCV2025 Artificial Social Intelligence Workshop will be a full-day event on Sunday, 10/19 in Room 317B
Join us to discuss social reasoning, multimodality, and embodiment in socially-intelligent AI agents!
The #ICCV2025 Artificial Social Intelligence Workshop will be a full-day event on Sunday, 10/19 in Room 317B
Join us to discuss social reasoning, multimodality, and embodiment in socially-intelligent AI agents!
The #ICCV2025 Artificial Social Intelligence Workshop will be a full-day event on Sunday, 10/19 in Room 317B
Join us to discuss social reasoning, multimodality, and embodiment in socially-intelligent AI agents!
Great to see the Artificial Social Intelligence workshop happening at #ICCV2025 co-organized by my PhD student @fionakryan, Postdoc-alumni @sangminlee777 and our amazing collaborators.
Join us on Sunday 10/19 in Room 317B.
Can we scale up mobile manipulation with egocentric human data?
Meet EMMA: Egocentric Mobile MAnipulation
EMMA learns from human mobile manipulation + static robot data — no mobile teleop needed!
EMMA generalizes to new scenes and scales strongly with added human data.
1/9
Robots struggle to learn new skills from human videos.
Why? We found that naive co-training produces disjoint distributions.
Our EgoBridge (NeurIPS’25) extends Optimal Transport to align human-robot latents, improving success by 44% and generalization to human-only tasks!🧵
#ICCV2025 3D VQA from multi-view RGBs? Hold my Gaussian 🎨
We present a generalizable 3DGS-Language framework for 3D visual QA, trained in a self-supervised manner, no 3D-language labels needed during training.
Project: https://t.co/INoRlluyKb
August 1 is the deadline for non-archival papers for the #ICCV2025 artificial social intelligence workshop!
Papers in this track will be presented as posters at ICCV. We welcome any submissions of ongoing work, recently published work, or #ICCV papers in areas related to socially-intelligent AI!
OpenReview submission here:
https://t.co/LigoxcqT7y
There is 1 more week to submit non-archival extended abstracts to present at the Artificial Social Intelligence workshop @ICCVConference! We welcome work recently published in other venues (including the main ICCV conference) as well as works in progress!
Excited to announce the Artificial Social Intelligence Workshop @ ICCV 2025 @ICCVConference
Join us in October to discuss the science of social intelligence and algorithms to advance socially-intelligent AI! Discussion will focus on reasoning, multimodality, and embodiment.
Are AI scientists already better than human researchers?
We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.
Main finding: LLM ideas result in worse projects than human ideas.