Khiem Vuong @kvuongdev - Twitter Profile

Pinned Tweet

about 1 year ago

[1/6] Recent models like DUSt3R generalize well across viewpoints, but performance drops on aerial-ground pairs. At #CVPR2025, we propose AerialMegaDepth (https://t.co/tDGMVXAFa7), a hybrid dataset combining mesh renderings with real ground images (MegaDepth) to bridge this gap.

7

552

103

330

56K

Khiem Vuong @kvuongdev

1 day ago

@ethanjohnweber Excited to see what you build next! Hope to catch up with you soon (will be in the Bay sometime next month)!

0

1

0

79

Khiem Vuong @kvuongdev

9 days ago

@CVPR @qiwang067 Hi @CVPR, just wanted to follow up on this. Do you know if this certificate email will be sent out soon? Such documentation would be very helpful for the records and various purposes, thank you so much!

0

16

kvuongdev retweeted

Sriram Narayanan @nnsriram97

12 days ago

[1/8] Video models generate stunning motion. But can you tell them how bouncy, slippery or soft something should be? PhyCo (CVPR2026): the same scene under different friction, restitution, deformation, or force — specified as input, not left to chance. 🧵 https://t.co/om2skODAYc

2

60

12

27

9K

Khiem Vuong @kvuongdev

18 days ago

@Jimantha Thanks Noah!

0

1

0

97

Khiem Vuong @kvuongdev

18 days ago

Awesome work from @Jimantha and co., as always! One aspect of our AerialMD work that we’ve always felt was underrated is its potential to help “metricize” everything through geotagged image registration. It’s great to see that vision being pushed further and executed nicely here. Congrats on the nice work @ambie_kk!

Yuanbo Xiangli @ambie_kk

18 days ago

Honey, I Shrunk the Arc de Triomphe! 😱 Ever notice how SOTA depth models suffer from "scale-collapse"—metrically shrinking distant landmarks like they're toys? We introduce MetricScenes: a new in-the-wild metric dataset that fixes this!

2

149

17

74

22K

1

5

0

1

2K

kvuongdev retweeted

Nikhil Keetha

@Nik__V__

21 days ago

Eventful @CVPR 2026 coming up! Presenting some of our latest research on scaling 3D, 4D & World Models 🚨 My talk at the Image Matching 2026 Workshop June 4th Room 504 1:45 pm LT - Scaling Representation Learning for Correspondence to Spatial Intelligence! Join for🌶️ takes @PeterHedman3 @RamananDeva talks at the ScanNet++ Workshop on View Synthesis & 3D Worlds - June 3rd R 710 3:40 pm LT Peter Kontschieder presenting World Modeling research (including stuff from @ethanjohnweber & team) - June 4th R 607 8 am LT, June 4th R 203 2:30 pm LT @JayKarhade @CMU_Robotics presenting Any4D - June 6th Poster Session 3 ExHall F 11:45 am LT, 4D Vision & 4D World Models Workshop Orals: June 4th R 506 4:30 pm LT, June 4th R 203 5 pm LT Lastly @OmarAlama @AviBh11 presenting our @AirLabCMU semantic scene understanding research - Findings (June 7th 7:30 am LT ExHall A) & OpenSUN3D Workshop (June 3rd afternoon) Sadly my first in person CV conference will have to wait 🥲but.. do attend for a sneak peek on what we are cooking! 👀🧵👇

4

100

16

50

18K

kvuongdev retweeted

Leo / Zeqing Yuan @Leo_ZQ_Yuan

21 days ago

Presenting two posters at #CVPR this week on vision with light and heat 👁️📷💡🔥 Thermal for Image Intrinsics 🔗 https://t.co/ezUhhibm9v 📍 Sun 5:30 – 7:30 PM, ExHall A 518 📍 Wed 11:15 - 12:50 PM, Mile High 4CD Revealing Heat Flows 📍 Wed 11:15 - 12:50 PM, Mile High 4CD

Leo_ZQ_Yuan's tweet photo. Presenting two posters at #CVPR this week on vision with light and heat 👁️📷💡🔥

Thermal for Image Intrinsics
🔗 https://t.co/ezUhhibm9v
📍 Sun 5:30 – 7:30 PM, ExHall A 518
📍 Wed 11:15 - 12:50 PM, Mile High 4CD

Revealing Heat Flows
📍 Wed 11:15 - 12:50 PM, Mile High 4CD https://t.co/nyH5dlzyNd

0

7

1

495

kvuongdev retweeted

Zhiqiu Lin

@ZhiqiuLin

about 2 months ago

Before AI can generate professional videos, it needs to see like a professional. We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would. Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%]. Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak. We built that language: 1️⃣ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics 2️⃣ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it 3️⃣ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5 4️⃣ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control Here's how each works 🧵 Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva. 📄 Paper: https://t.co/wCwEtvrntM 🌐 Site: https://t.co/oAAQklGrfF

25

372

63

494

35K

Khiem Vuong @kvuongdev

2 months ago

@songyoupeng @GoogleDeepMind Great work @songyoupeng! Results on in-the-wild examples looks amazing! I’m curious about the evaluation -- did you check for potential data leakage (e.g., whether the base model might have seen any of the evaluation data during pretraining)?

0

1

0

572

kvuongdev retweeted

Shangbang Long @ShangbangLong

2 months ago

🚀 Excited to announce Vision Banana 🍌 and our new paper: “Image Generators are Generalist Vision Learners”. We turn Nano Banana Pro into a state-of-the-art visual generation and understanding model. 🖼️ Check out our gallery at https://t.co/CEQJXroPaE 🧵 (1/N) continue ⬇️

22

432

70

263

61K

Khiem Vuong @kvuongdev

2 months ago

Hey, great question! In our experiments, we did find that zeroing out the temporal component of the original 3D RoPE makes training slower to converge, but it eventually reaches better performance. On PRoPE specifically: due to compute constraints, we weren’t able to fully finetune the entire Wan2.1-14B and were limited to LoRA. Our hypothesis is that PRoPE likely benefits much more from full SFT and longer training, since it effectively modifies the attention behavior. With LoRA alone, the model’s capacity to adapt to that change is somewhat constrained, so it probably wasn't enough.

0

2

0

1

110

Khiem Vuong @kvuongdev

2 months ago

[1/7] Video diffusion has come a long way, generating more & more realistic videos. Can we revisit sparse-view novel view synthesis through these video priors? Meet FrameCrafter: a permutation-invariant multi-view model built on video diffusion 🧵 🌐 https://t.co/ogEN4mkE92

2

150

32

99

10K

Khiem Vuong @kvuongdev

2 months ago

On a personal note -- this was my first time taking on more of a "mentoring" role as a senior PhD student, and it's been incredibly rewarding. All credits go to @qi_wu57 and my amazing collaborators! Also, stay tuned -- more exciting works coming soon! 😉

0

6

0

272

Khiem Vuong @kvuongdev

2 months ago

[7/7] Takeaway: video models already carry strong multi-view priors that are surprisingly easy to unlock, and it's easy to make them “forget” time. 📄 https://t.co/ogEN4mkE92 (code released) Led by @qi_wu57, w/ @Minsik_Je0n, Srinivasa Narasimhan, @RamananDeva at @CMU_Robotics.

1

4

0

1

297

kvuongdev retweeted

Ethan Weber @ethanjohnweber

3 months ago

I made a Claude Code skill that generates conference posters 🛠️ Instead of a static PDF, it outputs a single HTML file — drag to resize columns, swap sections, adjust fonts, then give your layout back to Claude. 🔁 🔗 Skill 👉 https://t.co/KhYV8anbxL

30

2K

330

3K

186K

Khiem Vuong

@kvuongdev

Last Seen Users on Sotwe

Trends for you

Most Popular Users