Today’s AI models can describe what they see, they cannot yet reason, simulate, and plan. SVI-Bench aims to push the limits of vision models towards achieving human level agency and autonomy.
Check our benchmark and put your novel models to the test: https://t.co/GtuXoNnBXm
In the second before a play develops, a basketball player can instantly recognize the defensive scheme (perception), anticipate how the defense will rotate (causal reasoning), simulate several possible outcomes (simulation), and choose the best move (decision).
Today's video AI is far from this. These models can describe what they see, but they cannot explain why something happened, predict what comes next, or decide how to respond. We introduce SVI-Bench to measure these capabilities, and to push toward models that can reason over real-world, multi-agent video.
Join us today @CVPR rooms 704/706 #CVPR2026 at the EgoVis workshop - a long tradition of workshops on egocentric vision from #ECCV2016 10 years ago!
This edition features great keynotes, talks from relevant CVPR papers, accepted contributions&abstracts, challenge&award results.
One of the most popular workshops at @CVPR Join us for engaging talks and discussions on the latest challenges and breakthroughs in computer vision and multimodal AI. See you there!
The 5th Transformers for Vision and Multimodal AI workshop is happening at #CVPR2026 tomorrow (Wednesday, June 3rd)! We've got a great speaker lineup covering diverse topics across Transformers and Multimodal AI.
When: Wed, June 3rd
Where: Room 607
Website: https://t.co/SD892nEr8z
Schedule:
1:50 - 2:00 Opening Remarks
2:00 - 2:30 Ranjay Krishna
2:30 - 3:00 Jiatao Gu
3:00 - 3:30 Sherry Yang
3:30 - 4:00 Coffee Break
4:00 - 4:30 Juan Carlos Niebles
4:30 - 5:00 Zhuang Liu
5:00 - 5:30 Peter Tong
See you all tomorrow!
@thoma_gu@RanjayKrishna@sherryyangML@jcniebles@liuzhuang1234@TongPetersb
I recently found this practical guide to building agents from OpenAI while doing some reading on agent evals. Nothing groundbreaking in terms of technical content, but it provides a really nice / rigorous structure around agent concepts and their tradeoffs that is useful.
What is an agent? It is possible to integrate an LLM into an automated workflow in a way that is not agentic; e.g., single-turn LLMs or chatbots. The core characteristic that makes a workflow agentic is whether the LLM is provided control of the workflow execution and allowed to make decisions. For example, an agent can control when the workflow is finished, attempt to recover from issues, and use tools to gather context or take actions.
Agent components. An agent is an LLM-powered system that includes multiple components in addition to the LLM:
- Tools: external functions or APIs for taking actions or gathering context.
- Instructions: written guidelines that describe in detail how the agent is expected to behave. We should draw upon existing documentation for our task
Usually, we are using a reasoning model for the LLM, meaning that the model also has the ability to dynamically reason over the instruction to determine how to decompose a problem and call tools in order to accomplish a desired task.
Beyond single agent. We can handle complex tasks with one agent by simply adding ore tools to provide more capabilities. Multiple agents can be helpful to decompose complex workflows, but the extra complexity can also lead to downsides / lower performance.
We should only use multi-agents systems when necessary. Some signs that using multiple agents could be helpful are:
- Instructions are complex, contain many conditional cases, and are becoming difficult to scale / manage.
- Your single agent is experiencing tool overload, meaning that it struggles to select the correct tools from the large set of tools available due to the presence of many similar tools.
Multi-agent systems. There are two main ways we can create a multi-agent system:
1. Manager setup: we have a central “manager” agent that delegates sub-tasks to multiple specialized sub-agents via tool calls and stitches their results into a final answer.
2. Decentralized setup: we have multiple peer agents that hand tasks off to one another based upon their specific purposes.
Both of these structures are common in practice, and we should aim to make each agent in the multi-agent system flexible / composable to simplify scaling of the system over time. We don’t want a brittle system that breaks every time we need to tweak guidelines or add a new capability.
Good instructions. One of the biggest keys to success is writing the best possible instructions for the agent(s). To write good instructions, we should:
- Draw upon existing documentation for the workflow.
- Clearly define guidelines and desired actions for the task.
- Prompt the agent to break the problem into steps.
- Provide concrete examples of how to handle edge cases.
An interesting paper from CVPR 2026: HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
https://t.co/8mQmcDeJTW
Fun fact: found my four papers from last 4 years cited in a row like that this. Nice sequence!
🎥🪄 What should happen when you remove an object from a video?
Example 1:
A domino chain is falling → remove the middle blocks → the last block should remain standing
Example 2:
Two cars are about to crash → remove one car → the other should drive away 🚙
Current video object removal models fail at these dynamic scenarios.
We introduce VOID: a model that removes objects and updates the scene as if they were never there.
🏆 Preferred 64.8% of the time vs Runway Aleph, Gen-Omnimatte, ProPainter, and more.
🌐 Project page: https://t.co/PBAWjuwUea
💻 GitHub: https://t.co/nYTv4miPSt
🤗 Demo: https://t.co/9DZpYCBUeN
📄 arXiv: https://t.co/UymkQC6Yku
w/ @willarvey@ZhuoningYuan@ChengTim0708 and collaborators at @NetflixResearch and @INSAITinstitute
Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026.
AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world.
Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video.
This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents.
🧵👇
If you're curious about the background that inspires a lot of our group's research on skill learning and video understanding, check out this great piece by UNC Research. It covers some of my journey from being a basketball player to an AI researcher.
https://t.co/rMqTJItpf1
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision.
We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Thrilled to share that I’ve joined Netflix as a Machine Learning Scientist 🎬
I’ll work on large-scale AI solutions for content understanding and promotions, focusing on Computer Vision, Video Understanding, and VLMs—advancing video AI in real-world, production-scale systems.
DeepSeek just dropped a banger paper to wrap up 2025
"mHC: Manifold-Constrained Hyper-Connections"
Hyper-Connections turn the single residual “highway” in transformers into n parallel lanes, and each layer learns how to shuffle and share signal between lanes.
But if each layer can arbitrarily amplify or shrink lanes, the product of those shuffles across depth makes signals/gradients blow up or fade out.
So they force each shuffle to be mass-conserving: a doubly stochastic matrix (nonnegative, every row/column sums to 1). Each layer can only redistribute signal across lanes, not create or destroy it, so the deep skip-path stays stable while features still mix!
with n=4 it adds ~6.7% training time, but cuts final loss by ~0.02, and keeps worst-case backward gain ~1.6 (vs ~3000 without the constraint), with consistent benchmark wins across the board
Hard to believe it’s been almost 5 years since I started at UNC. 2025 was an exciting year for our group!
🎓 My two PhD students—who joined me when I had an empty group—are graduating. Watching them grow into experts has been the best part of the job.
🏀 We are branching into Robotics & Sports (combining my personal passions with work!).
🎥 Our new video systems, BIMBA & SiLVR, achieved excellent performance across many challenging benchmarks.
🏆 Grateful for the awards we received across academia and industry this year.
I used to worry about making it in academia. Now, I'm just happy to be here. Huge thanks to my group for an incredible 2025. Here is a snapshot of what we accomplished! 📸
I defended my PhD on October 6, 2025. Yesterday, my professor @gberta227 treated me to a wonderful lunch and surprised me with a thoughtful gift!
I’m truly grateful for the support, guidance, and kindness throughout these years. Feeling excited for what comes next 😊🙏