We just released WatchAct, a benchmark for behavior-grounded robot manipulation (covered in the talk below). The robot has to watch a human video, infer what was done, make a plan, and then execute it. The best VLM (Gemini-3.1-Pro) only reaches 36.8% planning success rate, so still plenty of room for improvement.
Paper: https://t.co/z4COXQIMhh
Project: https://t.co/6CBUWQlEk8
Code: https://t.co/Kl8BO7vqiF
Dataset: https://t.co/IW0NKFUcZj
w/ @baiqil0203@cezhhh@yuffishh@dingmyu
In the second before a play develops, a basketball player can instantly recognize the defensive scheme (perception), anticipate how the defense will rotate (causal reasoning), simulate several possible outcomes (simulation), and choose the best move (decision).
Today's video AI is far from this. These models can describe what they see, but they cannot explain why something happened, predict what comes next, or decide how to respond. We introduce SVI-Bench to measure these capabilities, and to push toward models that can reason over real-world, multi-agent video.
What comes after today’s visual backbones?
At T4V @CVPR 2026, we’re bringing the community together for a focused half-day workshop on Transformers for Vision and Multimodal AI — covering image, video, 3D, MLLMs, efficient attention, SSMs/Mamba, and the next generation of visual architectures.
📍 Wed June 3 · Room 607
🕐 1:45–5:40 pm (Denver local time)
Invited speakers:
@RanjayKrishna, @thoma_gu, @sherryyangML, @jcniebles, @liuzhuang1234, and @TongPetersb.
Join us at CVPR:
https://t.co/uVTBoS2vqJ
@UNC@NVIDIAAI@NVIDIAAIDev@AIatMeta@ImagineEnpc@NJU1902@BaskinEng
#CVPR2026 #T4V #MultimodalAI #NVIDIA
🚨 Excited to share EgoMemReason, a benchmark for multi-level memory-driven reasoning (entity, event, and behavior memory) over week-long egocentric videos (average 25.9 hours of temporal backtracking)!
📉 Current long video approaches can retrieve isolated event, but struggle with long-horizon memory that requires retrieve and understand across multiple events and long time: tracking evolving entities across days, linking temporally distant events, and abstracting recurring behavior patterns from long observations.
🎥 EgoMemReason evaluates these challenges through 500 human-verified questions spanning entity, event, and behavior memory, requiring aggregation over an average of 5.1 evidence segments and 25.9 hours of temporal backtracking.
⭐️ Across 17 models/frameworks, even the best model achieves only 39.6% accuracy, revealing that long-horizon multimodal memory remains far from solved.
Thrilled to receive the NSF CAREER award! 🥳 Huge thanks to my students, collaborators, and mentors who made this possible. Grateful to NSF for supporting our work.
Hard to believe it’s been almost 5 years since I started at UNC. 2025 was an exciting year for our group!
🎓 My two PhD students—who joined me when I had an empty group—are graduating. Watching them grow into experts has been the best part of the job.
🏀 We are branching into Robotics & Sports (combining my personal passions with work!).
🎥 Our new video systems, BIMBA & SiLVR, achieved excellent performance across many challenging benchmarks.
🏆 Grateful for the awards we received across academia and industry this year.
I used to worry about making it in academia. Now, I'm just happy to be here. Huge thanks to my group for an incredible 2025. Here is a snapshot of what we accomplished! 📸
Is language a "terrible abstraction" for video understanding?
Many in the video community often dismiss language-driven approaches in favor of complex, video-native solutions. However, I believe this resistance stems more from internal bias—validating a research identity as a "vision/video researcher"—than from empirical reality.
Simple, language-driven systems often dominate complex, video-native solutions. Over the last two years, our group at UNC has developed a series of such language-driven frameworks (LLoVi, VideoTree, VidAssist, and SiLVR) based on a simple, modular pipeline:
Video Input → Dense Captioning → LLM Reasoning
Empirically, these simple systems frequently outperform sophisticated video-focused solutions across numerous benchmarks while offering significant advantages:
Scaling: By decoupling vision and reasoning, we can leverage the most powerful LLMs (even 1T+ parameters). Video-native approaches hit GPU memory walls, often limited to <256 frames, making long-form video analysis very difficult.
Adaptability & Flexibility: Better captioners or LLMs (released very frequently these days) instantly improve the system with minimal effort.
Training-Free: They work out-of-the-box and easily incorporate diverse data (bboxes, audio/speech, etc) as textual captions. This avoids the complex, resource-intensive training regimes used by many recent approaches (e.g., particularly RL-based video reasoning systems).
While end-to-end systems may eventually prevail, these language-driven frameworks currently offer the most performant and practical approach to video reasoning. They shouldn’t be dismissed just because they are driven by language; they should be used as strong baselines to advance the field.
SiLVR: https://t.co/iDJLJ4dq3z
LLoVi: https://t.co/GAjRplf6cl
VideoTree: https://t.co/wqFm5oh09r
VidAssist: https://t.co/6VoBPMvgx8
Recent advances in test-time optimization have led to remarkable reasoning capabilities in LLMs. However, the reasoning capabilities of MLLMs still significantly lag, especially for complex video-language tasks.
We present SiLVR, a Simple Language-based Video Reasoning framework.
Our framework offers several benefits.
1) Simplicity. No complex RL-based optimization or specialized modules for different tasks.
2) Generalizability. Can be applied to a wide range of complex video-language tasks.
3) Modularity. Enables seamless use of visual captioning models and LLMs.