Chancharik Mitra

Verified account

@chancharikm

MS in Machine Learning at CMU | Researching Machine Learning Reasoning in Multiple Modalities

Pittsburgh, PA

Joined June 2024

133 Following

189 Followers

91 Posts

Pinned Tweet

Chancharik Mitra

about 2 months ago

So excited to present CHAI, a long-term project that tackles video understanding in the right way 🎥 If your research or interests even loosely connect to video or multimodal reasoning, please check it out. Huge thanks to my co-authors and advisors!

about 2 months ago

Before AI can generate professional videos, it needs to see like a professional. We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would. Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%]. Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak. We built that language: 1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics 2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it 3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5 4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control Here's how each works 🧵 Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva. 📄 Paper: https://t.co/wCwEtvrntM 🌐 Site: https://t.co/oAAQklGrfF

25

372

63

494

35K

1

9

3

1

1K

chancharikm retweeted

5 days ago

This work would not have been possible without an incredible roster of co-authors and collaborators. First and foremost, a massive thank you to the leads who spearheaded this research and turned a massive two-year vision into reality: 🔸Yuvan Sharma, 🔸Dantong Niu @Dantong_Niu, and 🔸Anirudh Pai @apai253. Building robust robotics takes a village. A huge shoutout to the dedicated team who spent countless hours collecting data and pushing through rigorous evaluations. Flawless execution from: 🔸Zekai Wang, 🔸Zhuoyang Liu @liu_zhuoyang13, 🔸Baifeng Shi @baifeng_shi, 🔸Stefano Saravalle, 🔸Boning Shao, 🔸Ruijie Zheng, 🔸Jing Wang, 🔸Konstantinos Kallidromitis, 🔸Yusuke Kato, and 🔸Fabio Galasso! Finally, special thanks for the invaluable mentorship, support, and help shaping the narrative: 🔸Yuke Zhu @yukez, 🔸Danfei Xu @danfei_xu, 🔸Linxi Fan @DrJimFan, 🔸Trevor Darrell @trevordarrell, and 🔸Jitendra Malik @JitendraMalikCV. A truly monumental team effort from BAIR @berkeley_ai and NVIDIA @nvidia! 🚀

0

0

1

0

596

Chancharik Mitra

14 days ago

Incredibly grateful to have been a part of this project with @ZhiqiuLin! VQAScore is one of the most generalizable and lightweight frameworks for measuring all types of text-image or text-video alignment. Highly recommend everyone take a look with the new updated models in play!

14 days ago

Two years ago we released VQAScore with a simple idea: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It has since become the go-to evaluation metric and reward model for visual generation, replacing CLIPScore across the field. Adopted by Google DeepMind (Imagen 3 & 4), NVIDIA, ByteDance, and other frontier labs. Our open-source model has 2M+ downloads on HuggingFace. Today: a major upgrade. VQAScore now supports text-to-video evaluation using 20+ SOTA VLMs including GPT, Gemini, and Qwen, capturing generation accuracy for prompts like "a shallow depth of field shot rack focusing from a foreground crumpet to a dog entering the background, catching it at the focus shift." As VLMs get stronger, VQAScore gets stronger. For free. 📄 Paper: https://t.co/8CTwcPckL1 💻 Code: https://t.co/6ea4uwD5C1 Thanks to @chancharikm for driving this upgrade, and @gneubig and @RamananDeva.

0

54

17

23

57K

0

3

0

1

57

Chancharik Mitra

about 2 months ago

At a high level, CHAI is about building a compositional language for video: not just modeling subject motion, but also subtle, high-impact factors like camera motion and shot composition that are often ignored. Explicitly modeling these leads to more faithful understanding and better control for generation. Excited about implications for video understanding, text-to-video, robotics, and 4D reasoning! Full details (and great examples) in the main thread + paper reposted above!

0

2

0

0

321

Chancharik Mitra

about 2 months ago

So excited to present CHAI, a long-term project that tackles video understanding in the right way 🎥 If your research or interests even loosely connect to video or multimodal reasoning, please check it out. Huge thanks to my co-authors and advisors!

about 2 months ago

Before AI can generate professional videos, it needs to see like a professional. We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would. Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%]. Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak. We built that language: 1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics 2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it 3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5 4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control Here's how each works 🧵 Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva. 📄 Paper: https://t.co/wCwEtvrntM 🌐 Site: https://t.co/oAAQklGrfF

25

372

63

494

35K

1

9

3

1

1K

chancharikm retweeted

Khiem Vuong @kvuongdev

2 months ago

[1/7] Video diffusion has come a long way, generating more & more realistic videos. Can we revisit sparse-view novel view synthesis through these video priors? Meet FrameCrafter: a permutation-invariant multi-view model built on video diffusion 🧵 🌐 https://t.co/ogEN4mkE92

2

150

32

99

10K

Chancharik Mitra

2 months ago

They say a picture is worth a thousand words, precisely why the usefulness of visual representations will depend on the semantic context. Check out this exciting work on steering visual representations with language!

2 months ago

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

gaur_manu's tweet photo. Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat").
In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same
🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

13

901

135

668

150K

1

3

0

1

277

chancharikm retweeted

2 months ago

Most of today's AI can see the world, but it doesn’t **feel** it. Capturing the sense of touch is crucial for dexterous robotic manipulation, user modeling, and understanding physical interactions. Introducing OpenTouch: bringing full-hand tactile sensing into real-world AI🖐️ OpenTouch is collected in-the-wild using tactile sensing gloves, hand pose tracking gloves, and egocentric glasses. It includes: • 5 hours of real-world data, • 3 hours densely annotated contact-rich interactions, • 2,900 curated interaction clips, • across 800 objects, 14 environments, and 29 grasp types. all open at: https://t.co/nONVdxLwXJ

10

222

37

124

47K

Chancharik Mitra

3 months ago

@Z1hanW @iclr_conf Awesome! Congratulations, Zihan!

0

1

0

0

84

Chancharik Mitra

4 months ago

Robotic reward modeling needs to consider failure cases and represent partial progress. This work takes an interesting approach to dealing with these challenges, among other valuable contributions. Definitely worth a read!

4 months ago

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

8

410

105

235

100K

0

17

3

2

2K

chancharikm retweeted

Min-Hung (Steve) Chen

4 months ago

Current Vision-Language Models completely struggle with complex 4D dynamics. We fixed that. 🤯 🚨 Introducing 4D-RGPT: distilling perceptual knowledge directly into LLMs for precise space & time reasoning. 🎉 Excited to share our @NVIDIAAI work has been accepted to #CVPR2026! @C @CVPR A quick dive into how it works 🧵👇

2

80

16

39

14K

Chancharik Mitra

5 months ago

@kellerjordan0 ICL struggles for current instruction tuned models. But finetuning sparse, task-specific model representations with few-shot examples might be promising! Maybe, it's more about representations not learning algorithms! Our work might be worth checking out: https://t.co/04nk4l0VmY

0

0

0

1

92

Chancharik Mitra

6 months ago

That's a great question. Actually, the method is more of a finetuning approach than an architecture. It is inspired a lot from ideas in neuroscience. If we view the VLA as this brain that has learned useful things in pretraining, but we need to apply it to some specific task downstream, we wouldn't update the entire model, right (i.e. what LoRA and full FT does)? Instead, functional specificity in cog sci/neuroscience tells us that certain parts of the brain are useful for certain tasks. So our approach finetunes only the task relevant parts of the model while freezing everything else. Excitingly, this is not only more efficient, but also more performant and generalizable than standard finetuning! Feel free to check out the paper! Though it is technical, the abstract, intro, and motivation should be broadly accessible to a wider audience.

0

0

0

0

15

Chancharik Mitra

6 months ago

@batsuev_es That's a neat idea. No we haven't done that. I think what you suggest is "sparsity" or "ranking" in a different sense from our approach, but it is definitely worth considering.

0

1

0

0

141

Chancharik Mitra

6 months ago

🎉Despite massive pretraining, VLAs need to adapt to specific physical contexts. We introduce Robotic Steering, a novel finetuning method using mechanistic interpretability to surpass standard FT: 🎁 22× fewer parameters 🎁 +53% on unseen tasks 🎁 Interpretable Thread below👇

10

278

48

166

41K

Chancharik Mitra

6 months ago

@BDuisterhof Thanks, Bart!

0

1

0

0

47

chancharikm retweeted

6 months ago

Introduce CRISP, a real-to-sim pipeline that recovers human motion and simulatable scene geometry from monocular video! CRISP builds contact-faithful 3D scene for simulation - 8× fewer sim failures, +43% faster sim, and improves human motion! Interactive demos👉: https://t.co/locrdrxO16 Exciting collaboration w/ @JiashunWang @jefftan969 @_Tsukasane @ Jessica Hodgins @shubhtuls @RamananDeva

6

347

65

228

49K

Chancharik Mitra

6 months ago

@rafafelixphd And do stay tuned for more more exciting updates like code release, etc.

0

1

0

0

53

Chancharik Mitra

6 months ago

@rafafelixphd Yes, absolutely. Please do reach out if you have any questions!

1

1

0

0

199

Chancharik Mitra

6 months ago

I appreciate this perspective! In some sense, I agree the term "generalizable" intelligence is overloaded. As such, to be precise: what I am interested in with this work is actually enabling efficient, few-shot adaptation given just a few demos. This is the particular, specific sense of "generalization" we care about here. Now, I don't think there is any pivot necessarily. This is both a sample-efficient and parameter-efficient approach to finetuning that preserves more general pretrained capabilities compared to standard finetuning. In this way, Robotic Steering is not a tradeoff between these things, but rather a demonstration that a mechanistic approach to finetuning can have real benefits over LoRA and SFT. There's definitely more work to be done in this space, and I for one am excited to see the progress in the field!

0

0

0

0

177

Chancharik Mitra

6 months ago

📄 Paper: https://t.co/7oJYx3Zzuo 🌐 Project page: https://t.co/XNTdrI5eq9 Huge thanks to our amazing team across @CMU_Robotics , @berkeley_ai , and @USCViterbi 🙏! Yusen Luo @yusen_2001 , Raj Saravanan, Dantong Niu @Dantong_Niu , Anirudh Pai, Jesse Thomason @_jessethomason_ , Trevor Darrell @trevordarrell, Abrar Anwar @_abraranwar , Deva Ramanan @RamananDeva , Roei Herzig @roeiherzig

chancharikm's tweet photo. 📄 Paper: https://t.co/7oJYx3Zzuo
🌐 Project page: https://t.co/XNTdrI5eq9

Huge thanks to our amazing team across @CMU_Robotics , @berkeley_ai , and @USCViterbi 🙏!

Yusen Luo @yusen_2001 , Raj Saravanan, Dantong Niu @Dantong_Niu , Anirudh Pai, Jesse Thomason @_jessethomason_ , Trevor Darrell @trevordarrell, Abrar Anwar @_abraranwar , Deva Ramanan @RamananDeva , Roei Herzig @roeiherzig

0

13

1

7

1K

Chancharik Mitra

6 months ago

🔍 Bonus: It's interpretable! Different tasks activate different attention heads. You can visualize exactly which circuits drive each behavior. Mechanistic understanding meets robotics 🤝

chancharikm's tweet photo. 🔍 Bonus: It's interpretable!

Different tasks activate different attention heads. You can visualize exactly which circuits drive each behavior.

Mechanistic understanding meets robotics 🤝 https://t.co/OmBqbEnkw7

1

6

0

1

732

Last Seen Users on Sotwe

Trends for you

Most Popular Users