🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation!
🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence.
⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation.
⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA).
👇🧵
👋 Looking forward to attending #ACL2026 (in-person in San Diego) and #ICML2026 (probably virtually) for these presentations/workshop keynotes & meeting everyone (also, I'll be in the Bay Area beforehand, for a keynote at the Apple Reasoning and Planning Workshop)!
Feel free to ping if you want to meet up in Bay/SD (I also have July1 partly free in SJ/SF, and several days in SD), and discuss research, life, etc. (we're also hiring at all levels: phd, postdoc, faculty)! 🙂
PS. also meet several of our awesome students/postdocs/alumni attending these 2 conferences to present these works.
👇👇
Thrilled that VisionCoach collaborated is provisionally accepted to #ECCV2026! 🇸🇪💙
VisionCoach is an RL framework for reinforcing grounded video reasoning through visual-perception prompting and self-distillation. It uses RL to reward correct visual attention, with dynamic visual prompting serving as a training-time coach to improve spatio-temporal grounding, and achieve SOTA performance across video reasoning/understanding/temporal grounding benchmarks.
Details 👇
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation!
🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence.
⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation.
⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA).
👇🧵
🚨We updated Adaptive Visual Imagination Control (AVIC) with a new RL extension: AVIC-R.
In AVIC, we proposed adaptive visual imagination for spatial reasoning and found that world models are most useful when invoked selectively.
In AVIC-R, we take this one step further by training a policy with RL to decide when to imagine and how much to imagine. Using GRPO with QA correctness and imagination cost as rewards, AVIC-R learns to use world models more efficiently and improves spatial reasoning without relying on fixed test-time scaling.
Notably, a Qwen2.5-VL-7B policy trained with AVIC-R outperforms GPT-4o / GPT-4 as policy models, showing that RL can teach smaller models to make better adaptive imagination decisions.
See more details in Arxiv v2 👉https://t.co/kn1wdis3sf
(original thread below 👇)
🎉Excited to see VisionCoach accepted by #ECCV2026!
Adaptive visual prompting is key to scalable and effective video reasoning RL. Rather than applying the same perception strategy to every example, VisionCoach learns when and what extra visual guidance is needed, uses it to augment perception during RL, and internalizes this behavior for prompt-free raw-video inference.
👇
Thrilled that AnchorWeave is accepted to #ECCV2026!🎉
AnchorWeave tackles long-horizon memory for world-consistency modeling: instead of maintaining one noisy global 3D memory, it retrieves multiple local spatial 3D memories, avoiding multi-view geometric misalignment and learning to reconcile them during generation, which substantially improves world consistency while preserving visual quality.
Updated paper + code coming soon, with new 2000+ frame stress tests, runtime/memory analysis, and dynamic-scene examples, etc.
Details 👇
I'll be at #CVPR2026, feel free to ping if you want to meet up! Will be giving 4 different keynotes at these exciting @CVPR workshops and looking forward to engaging discussions on diverse topics 🙂
(also happy to discuss hiring at all levels: PhD, postdoc, faculty)
ps. also meet several of our awesome students/postdocs who will be attending
🚨 Excited to introduce PhyMotion🤸: Structured 3D Motion Reward for Physics-Grounded Human Video Generation!
❌ Existing 2D video rewards misleadingly assign high scores to videos with floating feet, self-penetrating limbs, and physics-violating motions.
✅ PhyMotion lifts generated videos into 3D, grounds them in a physics simulator, and scores motion along kinematic / contact / dynamic feasibility.
➡️ RL post-training with PhyMotion improves 1.3B model to match 14B models performance in human prefence.
🧵(1/n)👇
🚨 Excited to share EgoMemReason, a benchmark for multi-level memory-driven reasoning (entity, event, and behavior memory) over week-long egocentric videos (average 25.9 hours of temporal backtracking)!
📉 Current long video approaches can retrieve isolated event, but struggle with long-horizon memory that requires retrieve and understand across multiple events and long time: tracking evolving entities across days, linking temporally distant events, and abstracting recurring behavior patterns from long observations.
🎥 EgoMemReason evaluates these challenges through 500 human-verified questions spanning entity, event, and behavior memory, requiring aggregation over an average of 5.1 evidence segments and 25.9 hours of temporal backtracking.
⭐️ Across 17 models/frameworks, even the best model achieves only 39.6% accuracy, revealing that long-horizon multimodal memory remains far from solved.
Looking forward to giving a keynote at the Midwest Machine Learning Symposium (MMLS) 2026 (being held at Purdue University this year) & meeting folks from all the strong universities in the midwest, with their inspiring, long tradition of these exciting symposiums! 🙂
👇👇
🎉 Excited to share EPiC is accepted to #ICML2026!
We show that learning precise camera control for video diffusion doesn't need expensive 3D supervision or large-scale data. No camera or point cloud processing — just mask source videos based on visibility to construct precise training anchor videos, and learn a SoTA camera controller with only 30M params, trained >100× faster on >100× less data than prior work, while generalizing across both I2V and V2V camera control tasks.
🚨Cog-DRIFT: Breaking the Exploration Barrier in RLVR
RLVR has pushed LLM reasoning forward BUT hits a ceiling: if a model can't solve a problem (rollouts never succeed), it gets 0 learning signal 👉Hard problems stay unsolved, and training stalls.
We introduce✨Cog-DRIFT✨to reformulate hard problems into "cognitively" easier, structured variants (MCQ and cloze), then curriculum-train models from easy → hard to unlock new learning signals.
Key takeaways:
1⃣ Breaks the ceiling on "unsolvable" hard problems from 0% → 10.11% (Qwen) & 0% → 8.64% (Llama) in absolute gains
2⃣ Consistent gains across 6 benchmarks & 2 models → +4.72% (Qwen), +3.23% (Llama) over strong baselines
3⃣ Reformulations span discriminative → generative formats, enabling effective knowledge transfer back to hard open-ended reasoning
4⃣ Adaptive curriculum matters: training progresses from easier variants to harder ones, leading to continued improvement and improved sample efficiency
5⃣ Also boosts test-time performance (pass@k), showing acquisition of new reasoning patterns from hard problems that were beyond a model's accessibility w/o Cog-DRIFT
🧵👇
🥳 I am incredibly honored and grateful to receive the 2026 @UNC Distinguished Dissertation Award!
This award recognizes four recipients across the whole university, and I’m humbled to represent the Mathematics, Physical Sciences, and Engineering category this year.
Many thanks to my advisor @mohitban47, our MURGe-Lab family, and the @unccs@unc_ai_group for their constant support! 🙏
This is a great reminder of all the good memories from my PhD journey before I start my faculty career at The Johns Hopkins University 😊
🔥 Check out my close coworker @shoubin621’s interesting new egocentric benchmark! I believe this could be a groundbreaking step toward AR glasses applications.
🔎 I've always wondered about the exact name or brand of a product I encounter in the real world.
Ego2Web is a benchmark for real-world grounded web agents that must understand egocentric video and use that context to perform actions on the web. Both interesting and highly practical! ⚡️
Introducing Ego2Web from Google DeepMind and UNC Chapel Hill, accepted to #CVPR2026.
AI agents can browse the web. But can they act based on what you see? Existing benchmarks focus only on web interaction while ignoring the real world.
Ego2Web bridges egocentric video perception and web execution, enabling agents that can see through first-person video, understand real-world context, and take actions on the web grounded in the egocentric video.
This opens a path toward AI assistants that operate seamlessly across physical and digital environments. We hope Ego2Web serves as an important step for building more capable, perception-driven agents.
🧵👇
✨ Excited to share V-Co, a systematic look and recipe for visual co-denoising in pixel-space diffusion.
Instead of loosely injecting pretrained features, we study how pixels and semantics should be jointly denoised and properly aligned.
- Dual-stream design for clean interaction
- Structural masking for stronger CFG
- Hybrid loss for richer supervision
- Simple scaling for stable training
A simple yet principled recipe, delivering strong and consistent gains.
Details👇
🚀 Excited to share V-Co, a diffusion model that jointly denoises pixels and pretrained semantic features (e.g., DINO).
We find a simple but effective recipe:
1️⃣ architecture matters a lot --> fully dual-stream JiT
2️⃣ CFG needs a better unconditional branch --> semantic-to-pixel masking for CFG
3️⃣ the best semantic supervision is hybrid --> perceptual-drifting hybrid loss
4️⃣ calibration is essential --> RMS-based feature rescaling
We conducted a systematic study on V-Co, which is highly competitive at a comparable scale, and outperforms JiT-G/16 (~2B, FID 1.82) with fewer training epochs.
🧵 👇
🚨Excited to share our work on VisionCoach!
-Video reasoning isn’t failing because models can’t reason —it’s failing because they don’t see correctly.
-Instead of adding more tools at inference, we teach models how to look during training.
✨VisionCoach = visual prompting (train-time) + RL with self-distillation -> grounded reasoning with tool-free inference
Check out our new work ⚽️VisionCoach, an RL + self-distillation framework for complex video reasoning.
We combine reinforcement learning with dynamic visual prompting, where a visual prompt selector adaptively augments hard training examples based on reward signals.
Visual grounding is key to accurate video reasoning. Instead of adding complexity at inference, we use visual prompting during training to guide models toward better spatio-temporal attention—then distill this capability into a simple, single-path model.
Awesome collaboration with @shoubin621@zhan1624@mohitban47@unc_ai_group@unccs
Check the full paper for more details!
- ArXiv: https://t.co/DKMv1T31zf
- Code: https://t.co/sqMWgsqQPb
- Webpage: https://t.co/EPb8oqHZdv
- @huggingface page: https://t.co/99YikNSH0i
- @huggingface model: https://t.co/dbtd2SYOTy
🚨 Excited to share VisionCoach, an RL framework for reinforcing grounded video reasoning via visual-perception prompting and self-distillation!
🧠 Video reasoning models often miss where to look or rely on language priors. Instead of only supervising final answers, we encourage the model to learn to attend to the right visual evidence.
⚽️ VisionCoach uses RL to reward correct visual attention, with dynamic visual prompting as a training-time coach for better spatio-temporal grounding, while keeping inference simple and tool-free via self-distillation.
⭐️ Achieves state-of-the-art zero-shot performance across video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA).
👇🧵
✏️ Analysis: Spatio-temporal Attention
- Visual prompting improves spatio-temporal grounding by increasing attention on the correct key frame and focusing on the relevant spatial region.
- It highlights key visual attributes (e.g., the cowboy’s clothing) while suppressing irrelevant regions.
Please refer to the demo video for more qualitative examples!