So excited to present CHAI, a long-term project that tackles video understanding in the right way 🎥
If your research or interests even loosely connect to video or multimodal reasoning, please check it out. Huge thanks to my co-authors and advisors!
Before AI can generate professional videos, it needs to see like a professional.
We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would.
Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%].
Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak.
We built that language:
1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics
2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it
3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5
4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control
Here's how each works 🧵
Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva.
📄 Paper: https://t.co/wCwEtvrntM
🌐 Site: https://t.co/oAAQklGrfF
This work would not have been possible without an incredible roster of co-authors and collaborators.
First and foremost, a massive thank you to the leads who spearheaded this research and turned a massive two-year vision into reality: 🔸Yuvan Sharma, 🔸Dantong Niu @Dantong_Niu, and 🔸Anirudh Pai @apai253.
Building robust robotics takes a village. A huge shoutout to the dedicated team who spent countless hours collecting data and pushing through rigorous evaluations. Flawless execution from: 🔸Zekai Wang, 🔸Zhuoyang Liu @liu_zhuoyang13, 🔸Baifeng Shi @baifeng_shi, 🔸Stefano Saravalle, 🔸Boning Shao, 🔸Ruijie Zheng, 🔸Jing Wang, 🔸Konstantinos Kallidromitis, 🔸Yusuke Kato, and 🔸Fabio Galasso!
Finally, special thanks for the invaluable mentorship, support, and help shaping the narrative: 🔸Yuke Zhu @yukez, 🔸Danfei Xu @danfei_xu, 🔸Linxi Fan @DrJimFan, 🔸Trevor Darrell @trevordarrell, and 🔸Jitendra Malik @JitendraMalikCV.
A truly monumental team effort from BAIR @berkeley_ai and NVIDIA @nvidia! 🚀
Incredibly grateful to have been a part of this project with @ZhiqiuLin! VQAScore is one of the most generalizable and lightweight frameworks for measuring all types of text-image or text-video alignment. Highly recommend everyone take a look with the new updated models in play!
Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score.
It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace.
Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift."
As VLMs get stronger, VQAScore gets stronger. For free.
📄 Paper: https://t.co/8CTwcPckL1
💻 Code: https://t.co/6ea4uwD5C1
Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.
At a high level, CHAI is about building a compositional language for video: not just modeling subject motion, but also subtle, high-impact factors like camera motion and shot composition that are often ignored.
Explicitly modeling these leads to more faithful understanding and better control for generation.
Excited about implications for video understanding, text-to-video, robotics, and 4D reasoning!
Full details (and great examples) in the main thread + paper reposted above!
So excited to present CHAI, a long-term project that tackles video understanding in the right way 🎥
If your research or interests even loosely connect to video or multimodal reasoning, please check it out. Huge thanks to my co-authors and advisors!
Before AI can generate professional videos, it needs to see like a professional.
We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would.
Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%].
Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak.
We built that language:
1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics
2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it
3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5
4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control
Here's how each works 🧵
Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva.
📄 Paper: https://t.co/wCwEtvrntM
🌐 Site: https://t.co/oAAQklGrfF
[1/7] Video diffusion has come a long way, generating more & more realistic videos.
Can we revisit sparse-view novel view synthesis through these video priors?
Meet FrameCrafter: a permutation-invariant multi-view model built on video diffusion 🧵
🌐 https://t.co/ogEN4mkE92
They say a picture is worth a thousand words, precisely why the usefulness of visual representations will depend on the semantic context.
Check out this exciting work on steering visual representations with language!
Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat").
In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same
🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.
Most of today's AI can see the world, but it doesn’t **feel** it.
Capturing the sense of touch is crucial for dexterous robotic manipulation, user modeling, and understanding physical interactions.
Introducing OpenTouch: bringing full-hand tactile sensing into real-world AI🖐️
OpenTouch is collected in-the-wild using tactile sensing gloves, hand pose tracking gloves, and egocentric glasses. It includes:
• 5 hours of real-world data,
• 3 hours densely annotated contact-rich interactions,
• 2,900 curated interaction clips,
• across 800 objects, 14 environments, and 29 grasp types.
all open at: https://t.co/nONVdxLwXJ
Robotic reward modeling needs to consider failure cases and represent partial progress. This work takes an interesting approach to dealing with these challenges, among other valuable contributions. Definitely worth a read!
A reward model that works, zero-shot, across robots, tasks, and scenes?
Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories.
Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more!
🧵 (1/12)
Current Vision-Language Models completely struggle with complex 4D dynamics. We fixed that. 🤯
🚨 Introducing 4D-RGPT: distilling perceptual knowledge directly into LLMs for precise space & time reasoning.
🎉 Excited to share our @NVIDIAAI work has been accepted to #CVPR2026! @C@CVPR
A quick dive into how it works 🧵👇
@kellerjordan0 ICL struggles for current instruction tuned models. But finetuning sparse, task-specific model representations with few-shot examples might be promising! Maybe, it's more about representations not learning algorithms! Our work might be worth checking out:
https://t.co/04nk4l0VmY
That's a great question. Actually, the method is more of a finetuning approach than an architecture. It is inspired a lot from ideas in neuroscience. If we view the VLA as this brain that has learned useful things in pretraining, but we need to apply it to some specific task downstream, we wouldn't update the entire model, right (i.e. what LoRA and full FT does)?
Instead, functional specificity in cog sci/neuroscience tells us that certain parts of the brain are useful for certain tasks. So our approach finetunes only the task relevant parts of the model while freezing everything else. Excitingly, this is not only more efficient, but also more performant and generalizable than standard finetuning!
Feel free to check out the paper! Though it is technical, the abstract, intro, and motivation should be broadly accessible to a wider audience.
@batsuev_es That's a neat idea. No we haven't done that. I think what you suggest is "sparsity" or "ranking" in a different sense from our approach, but it is definitely worth considering.
🎉Despite massive pretraining, VLAs need to adapt to specific physical contexts. We introduce Robotic Steering, a novel finetuning method using mechanistic interpretability to surpass standard FT:
🎁 22× fewer parameters
🎁 +53% on unseen tasks
🎁 Interpretable
Thread below👇
Introduce CRISP, a real-to-sim pipeline that recovers human motion and simulatable scene geometry from monocular video!
CRISP builds contact-faithful 3D scene for simulation - 8× fewer sim failures, +43% faster sim, and improves human motion!
Interactive demos👉: https://t.co/locrdrxO16
Exciting collaboration w/ @JiashunWang@jefftan969@_Tsukasane @ Jessica Hodgins @shubhtuls@RamananDeva
I appreciate this perspective! In some sense, I agree the term "generalizable" intelligence is overloaded. As such, to be precise: what I am interested in with this work is actually enabling efficient, few-shot adaptation given just a few demos. This is the particular, specific sense of "generalization" we care about here.
Now, I don't think there is any pivot necessarily. This is both a sample-efficient and parameter-efficient approach to finetuning that preserves more general pretrained capabilities compared to standard finetuning. In this way, Robotic Steering is not a tradeoff between these things, but rather a demonstration that a mechanistic approach to finetuning can have real benefits over LoRA and SFT. There's definitely more work to be done in this space, and I for one am excited to see the progress in the field!
🔍 Bonus: It's interpretable!
Different tasks activate different attention heads. You can visualize exactly which circuits drive each behavior.
Mechanistic understanding meets robotics 🤝