[1/6] Recent models like DUSt3R generalize well across viewpoints, but performance drops on aerial-ground pairs.
At #CVPR2025, we propose AerialMegaDepth (https://t.co/tDGMVXAFa7), a hybrid dataset combining mesh renderings with real ground images (MegaDepth) to bridge this gap.
@CVPR@qiwang067 Hi @CVPR, just wanted to follow up on this. Do you know if this certificate email will be sent out soon? Such documentation would be very helpful for the records and various purposes, thank you so much!
[1/8] Video models generate stunning motion. But can you tell them how bouncy, slippery or soft something should be?
PhyCo (CVPR2026): the same scene under different friction, restitution, deformation, or force — specified as input, not left to chance. 🧵
https://t.co/om2skODAYc
Awesome work from @Jimantha and co., as always! One aspect of our AerialMD work that we’ve always felt was underrated is its potential to help “metricize” everything through geotagged image registration. It’s great to see that vision being pushed further and executed nicely here. Congrats on the nice work @ambie_kk!
Honey, I Shrunk the Arc de Triomphe! 😱
Ever notice how SOTA depth models suffer from "scale-collapse"—metrically shrinking distant landmarks like they're toys? We introduce MetricScenes: a new in-the-wild metric dataset that fixes this!
Eventful @CVPR 2026 coming up! Presenting some of our latest research on scaling 3D, 4D & World Models 🚨
My talk at the Image Matching 2026 Workshop June 4th Room 504 1:45 pm LT -
Scaling Representation Learning for Correspondence to Spatial Intelligence! Join for🌶️ takes
@PeterHedman3@RamananDeva talks at the ScanNet++ Workshop on View Synthesis & 3D Worlds - June 3rd R 710 3:40 pm LT
Peter Kontschieder presenting World Modeling research (including stuff from @ethanjohnweber & team) - June 4th R 607 8 am LT, June 4th R 203 2:30 pm LT
@JayKarhade@CMU_Robotics presenting Any4D - June 6th Poster Session 3 ExHall F 11:45 am LT, 4D Vision & 4D World Models Workshop Orals: June 4th R 506 4:30 pm LT, June 4th R 203 5 pm LT
Lastly @OmarAlama@AviBh11 presenting our @AirLabCMU semantic scene understanding research - Findings (June 7th 7:30 am LT ExHall A) & OpenSUN3D Workshop (June 3rd afternoon)
Sadly my first in person CV conference will have to wait 🥲but.. do attend for a sneak peek on what we are cooking! 👀🧵👇
Presenting two posters at #CVPR this week on vision with light and heat 👁️📷💡🔥
Thermal for Image Intrinsics
🔗 https://t.co/ezUhhibm9v
📍 Sun 5:30 – 7:30 PM, ExHall A 518
📍 Wed 11:15 - 12:50 PM, Mile High 4CD
Revealing Heat Flows
📍 Wed 11:15 - 12:50 PM, Mile High 4CD
Before AI can generate professional videos, it needs to see like a professional.
We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would.
Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%].
Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak.
We built that language:
1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics
2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it
3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5
4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control
Here's how each works 🧵
Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva.
📄 Paper: https://t.co/wCwEtvrntM
🌐 Site: https://t.co/oAAQklGrfF
@songyoupeng@GoogleDeepMind Great work @songyoupeng! Results on in-the-wild examples looks amazing! I’m curious about the evaluation -- did you check for potential data leakage (e.g., whether the base model might have seen any of the evaluation data during pretraining)?
🚀 Excited to announce Vision Banana 🍌 and our new paper: “Image Generators are Generalist Vision Learners”. We turn Nano Banana Pro into a state-of-the-art visual generation and understanding model.
🖼️ Check out our gallery at https://t.co/CEQJXroPaE
🧵 (1/N) continue ⬇️
Hey, great question! In our experiments, we did find that zeroing out the temporal component of the original 3D RoPE makes training slower to converge, but it eventually reaches better performance.
On PRoPE specifically: due to compute constraints, we weren’t able to fully finetune the entire Wan2.1-14B and were limited to LoRA. Our hypothesis is that PRoPE likely benefits much more from full SFT and longer training, since it effectively modifies the attention behavior. With LoRA alone, the model’s capacity to adapt to that change is somewhat constrained, so it probably wasn't enough.
[1/7] Video diffusion has come a long way, generating more & more realistic videos.
Can we revisit sparse-view novel view synthesis through these video priors?
Meet FrameCrafter: a permutation-invariant multi-view model built on video diffusion 🧵
🌐 https://t.co/ogEN4mkE92
On a personal note -- this was my first time taking on more of a "mentoring" role as a senior PhD student, and it's been incredibly rewarding. All credits go to @qi_wu57 and my amazing collaborators!
Also, stay tuned -- more exciting works coming soon! 😉
[7/7] Takeaway: video models already carry strong multi-view priors that are surprisingly easy to unlock, and it's easy to make them “forget” time.
📄 https://t.co/ogEN4mkE92 (code released)
Led by @qi_wu57, w/ @Minsik_Je0n, Srinivasa Narasimhan, @RamananDeva at @CMU_Robotics.
I made a Claude Code skill that generates conference posters 🛠️
Instead of a static PDF, it outputs a single HTML file — drag to resize columns, swap sections, adjust fonts, then give your layout back to Claude. 🔁
🔗 Skill 👉 https://t.co/KhYV8anbxL