Phillip (Yuseung) Lee @yuseungleee - Twitter Profile

Pinned Tweet

about 1 year ago

❗️Vision-Language Models (VLMs) struggle with even basic perspective changes! ✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives. 📄Paper: https://t.co/qq5s8jHtVN 🔗Project: https://t.co/sh5W8VLwZO 🧵[1/N]

yuseungleee's tweet photo. ❗️Vision-Language Models (VLMs) struggle with even basic perspective changes!

✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives.

📄Paper: https://t.co/qq5s8jHtVN
🔗Project: https://t.co/sh5W8VLwZO

🧵[1/N] https://t.co/Bo3axJ16k9

4

149

35

78

22K

yuseungleee retweeted

Siyi Chen

@ChenSiyich

3 days ago

Wonderful to be back from #CVPR2026, and excited to share the release of our follow-up work: VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation VoLo introduces the idea of a physical orchestrator for open-vocabulary, long-horizon manipulation. Our goal is to move toward robots that can reason, plan, act, monitor, and recover by adaptively using VLA/WAMs, vision models, and action primitives as tools. We introduce three main contributions: 🤖 VoLoAgent — a physical orchestrator that plans, monitors, and recovers by adaptively using, halting, and redirecting robot actions with tools. 📊 RoboVoLo — a high-fidelity benchmark with 126 open-vocabulary long-horizon manipulation tasks spanning common sense, memory/state tracking, complex references, and world knowledge. 📈 A large-scale empirical study comparing action models, code-as-policy systems, TAMP-style systems, and ablations of the VoLoAgent orchestrator, complemented by real-robot experiments. This work was done during my internship at @NVIDIA and would not have been possible without my brilliant collaborators: Hugo Hadfield, Alexander Zook, @mikacuy, @luke_ch_song, @erwincoumans, @xuningy, Faisal Ladhak, @qu_1006, @BirchfieldStan, Jonathan Tremblay, and @robovalts. Huge thanks to everyone! 🔗 Project: https://t.co/Q2pEymou7U 🔗 Previous work, SpaceTools: https://t.co/xNLUjiNG4j #Robotics #EmbodiedAI #VisionLanguageModels #VLAModels #RobotLearning #NVIDIA #CVPR2026 #LongHorizonManipulation #AI #ComputerVision

ChenSiyich's tweet photo. Wonderful to be back from #CVPR2026, and excited to share the release of our follow-up work:

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

VoLo introduces the idea of a physical orchestrator for open-vocabulary, long-horizon manipulation. Our goal is to move toward robots that can reason, plan, act, monitor, and recover by adaptively using VLA/WAMs, vision models, and action primitives as tools.

We introduce three main contributions:

🤖 VoLoAgent — a physical orchestrator that plans, monitors, and recovers by adaptively using, halting, and redirecting robot actions with tools.

📊 RoboVoLo — a high-fidelity benchmark with 126 open-vocabulary long-horizon manipulation tasks spanning common sense, memory/state tracking, complex references, and world knowledge.

📈 A large-scale empirical study comparing action models, code-as-policy systems, TAMP-style systems, and ablations of the VoLoAgent orchestrator, complemented by real-robot experiments.

This work was done during my internship at @NVIDIA and would not have been possible without my brilliant collaborators: Hugo Hadfield, Alexander Zook, @mikacuy, @luke_ch_song, @erwincoumans, @xuningy, Faisal Ladhak, @qu_1006, @BirchfieldStan, Jonathan Tremblay, and @robovalts. Huge thanks to everyone!

🔗 Project: https://t.co/Q2pEymou7U
🔗 Previous work, SpaceTools: https://t.co/xNLUjiNG4j

#Robotics #EmbodiedAI #VisionLanguageModels #VLAModels #RobotLearning #NVIDIA #CVPR2026 #LongHorizonManipulation #AI #ComputerVision

2

67

15

31

8K

Phillip (Yuseung) Lee @yuseungleee

9 days ago

MUSI workshop happening at Room 601 with a large audience! #CVPR2026

Minhyuk Sung @ CVPR @MinhyukSung

9 days ago

🚀MUSI @ CVPR 2026 starts in 30 minutes! Join us at Room 601 🌐https://t.co/8bAErFm7Fg

0

13

3

1

1K

0

8

1

0

702

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

9 days ago

🚀MUSI @ CVPR 2026 starts in 30 minutes! Join us at Room 601 🌐https://t.co/8bAErFm7Fg

0

13

3

1

1K

Who to follow

Minghua Liu @ CVPR26

@MinghuaLiu_

Founding member @sudo_robotics. Embodied AI, 3D vision. | ex: @nvidia @Qualcomm @Waymo @Adobe @ucsd_cse @Tsinghua_Uni

Dongkeun Yoon

@dongkeun_yoon

Applied Scientist Intern @awscloud (Seattle). PhD student @kaist_ai. Researching multilinguality in LLMs.

Minhyuk Sung @ CVPR

@MinhyukSung

Associate professor @ KAIST | KAIST Visual AI Group: https://t.co/mblvQKFc8t.

yuseungleee retweeted

Phillip (Yuseung) Lee @yuseungleee

10 days ago

#CVPR2026 @cvpr If you're interested in the intersection of multimodal and spatial intelligence, join our ✨MUSI workshop✨ on June 3 (Wed)! We’re bringing together an amazing lineup of speakers to discuss the latest and most exciting topics in multimodal spatial intelligence🧠

yuseungleee's tweet photo. #CVPR2026 @cvpr If you're interested in the intersection of multimodal and spatial intelligence, join our ✨MUSI workshop✨ on June 3 (Wed)!

We’re bringing together an amazing lineup of speakers to discuss the latest and most exciting topics in multimodal spatial intelligence🧠 https://t.co/rTmx5zDuwH

0

23

6

4

4K

Phillip (Yuseung) Lee @yuseungleee

10 days ago

#CVPR2026 @cvpr If you're interested in the intersection of multimodal and spatial intelligence, join our ✨MUSI workshop✨ on June 3 (Wed)! We’re bringing together an amazing lineup of speakers to discuss the latest and most exciting topics in multimodal spatial intelligence🧠

Minhyuk Sung @ CVPR @MinhyukSung

11 days ago

🚀 Join MUSI @ CVPR 2026! June 3 (Wed), 8:10–12:35, Room 601. Talk on spatial reasoning, world models, embodied AI & 3D. 🌐https://t.co/8bAErFm7Fg

1

19

3

13K

0

23

6

4

4K

yuseungleee retweeted

Mikaela Angelina Uy @mikacuy

10 days ago

Come and join us at MUSI in @CVPR 2026! See you all in Denver!

0

16

5

2

4K

yuseungleee retweeted

Roozbeh Mottaghi

@RoozbehMottaghi

11 days ago

I am giving a talk on future prediction and world modeling in 3D at this workshop #CVPR2026

0

13

1

0

2K

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

11 days ago

🚀 Join MUSI @ CVPR 2026! June 3 (Wed), 8:10–12:35, Room 601. Talk on spatial reasoning, world models, embodied AI & 3D. 🌐https://t.co/8bAErFm7Fg

1

19

3

13K

yuseungleee retweeted

Chan Hee (Luke) Song @CVPR2026

@luke_ch_song

14 days ago

Do VLMs actually understand 3D space 🌎? Or are they exploiting shortcuts hidden in natural images? 🚀 Excited to share our new work: Why Far Looks Up: Probing Spatial Representation in Vision-Language Models @NVIDIAAI × @SeoulNatlUni × @OhioStateCSE 🧵👇

luke_ch_song's tweet photo. Do VLMs actually understand 3D space 🌎?

Or are they exploiting shortcuts hidden in natural images?

🚀 Excited to share our new work:

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

@NVIDIAAI × @SeoulNatlUni × @OhioStateCSE

🧵👇 https://t.co/UwDS58PkGu

6

229

45

145

19K

yuseungleee retweeted

Jaihoon Kim @KimJaihoon

16 days ago

(1/9) Should we fine-tune the diffusion model for reward alignment? 🤔 Not really. Instead, learn the twist function. We introduce Contrastive Distribution Matching to amortize the cost of inference scaling. 🚀 Website: https://t.co/Kq9xL4SZkH Paper: https://t.co/7a1DdK4OnY

KimJaihoon's tweet photo. (1/9) Should we fine-tune the diffusion model for reward alignment? 🤔

Not really. Instead, learn the twist function.

We introduce Contrastive Distribution Matching to amortize the cost of inference scaling. 🚀

Website: https://t.co/Kq9xL4SZkH
Paper: https://t.co/7a1DdK4OnY https://t.co/H0UYSq2ewZ

1

13

5

7

2K

yuseungleee retweeted

Zifan Zhao

@Zifan_Zhao_2718

17 days ago

What is the most elegant way to give MLLMs spatial awareness? Instead of adding heavy 3D modules, we let the model learn a simple question: “Where am I, and where am I looking?” Introducing Cambrian-P, a new learning paradigm for video understanding. (1/n)

Zifan_Zhao_2718's tweet photo. What is the most elegant way to give MLLMs spatial awareness?

Instead of adding heavy 3D modules, we let the model learn a simple question:
“Where am I, and where am I looking?”

Introducing Cambrian-P, a new learning paradigm for video understanding. (1/n)

4

43

11

25

4K

yuseungleee retweeted

Jihwan Kim

@jji_hwannn

24 days ago

Still struggling with frame scaling in Video LLMs? 🤯 💫 Introducing LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs, our work done at @GoogleDeepMind. TL;DR: We propose LiteFrame, a highly efficient video encoder for Video Large Language Models to resolve inefficiencies in both the LLM and the ViT. [1/n]

1

114

17

109

17K

yuseungleee retweeted

Siddharth Joshi

@sjoshi804

30 days ago

Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them. I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas. This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold. Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!! I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again. Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year. At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B. Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :) a 🧵

sjoshi804's tweet photo. Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in chasing paper deadlines and applying for fellowships / grants, but research in the truest sense of the word - sitting with unsolved, sometimes previously unheard-of problems, contextualizing them, formulating them, exploring solutions to them.

I'd had a taste of research in college, flitting between disciplines, but never found something I felt truly passionate about until I came across deep learning. A field mixing empiricism, mathematics, and real-world impact all seamlessly - it made research the most exciting thing I'd ever done in my life. So in 2022 I started my PhD hoping for the chance to explore uncharted frontiers. Three years and several papers at the standard prestigious ML conferences later, I had technically done research. But I still didn't feel like I'd ever had the freedom, support, and resources to explore new and exciting ideas.

This is what brought me to Datology as an intern last summer. A hope to do research in the true sense - explore new ideas, supported by my peers and leaders, unconstrained by resources. And of course, about the data. At the end of the summer, I took a risk and stayed, putting my PhD on hold.

Since then, I've been lucky enough to grow into leading multimodal data curation at DatologyAI, and with our team we've tackled every challenge possible: the engineering and optimizing of a VLM training stack we built from scratch; the at-times frustrating but ultimately rewarding deep refining of VLM evals in our work DatBench (link); and of course a lot of exhilarating new research on DATA CURATION. But more than anything, I felt like I finally got to do research!!

I'd like to specifically thank @arimorcos and @leavittron who entrusted me with this opportunity, empowered me to do the best work of my life (so far), and mentored me to grow not only as a researcher but also as a leader. And a huge thanks to the @datologyai team that made research feel FUN again.

Today, we're releasing 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone. This is the culmination of the multimodal team at Datology's work over the past year.

At fixed architecture, recipe, and compute, varying only the pretraining data, we get +11.7pp at 2B across 20 public VLM benchmarks, beat InternVL3.5-2B by ~10pp at ~17x less training compute (without post-training), and hit near-frontier accuracy at 4B with 3.3x lower response FLOPs than Qwen3-VL-4B.

Take risks. Bet on yourself. I’m going to keep doing this. At least until my luck runs out :)

a 🧵

10

334

34

144

791K

yuseungleee retweeted

Eliya Habba @EliyaHabba

about 1 month ago

New datasets keep coming, New models keep coming. Frustrating! How can we evaluate everything on everything? How do we keep scores comparable over time? We propose a way to grow benchmark suites without losing comparability. Details:👇🧵

EliyaHabba's tweet photo. New datasets keep coming,
New models keep coming.

Frustrating!
How can we evaluate everything on everything?
How do we keep scores comparable over time?

We propose a way to grow benchmark suites without losing comparability.

Details:👇🧵 https://t.co/zYdFUfTCzk

3

40

12

5

2K

yuseungleee retweeted

Jiawei Yang

@JiaweiYang118

about 1 month ago

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

JiaweiYang118's tweet photo. Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.

Now it is 0.75, and can be even lower.

Many wonder how.

I thought it might end as a small FID prank: simple and deliberate.

It started with one question: can FID be optimized directly, and what does it reveal?

Introducing FD-loss.

56

954

157

627

229K

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

about 2 months ago

#ICLR2026 [2/2] This afternoon (Apr 24), we’ll present 𝗕é𝘇𝗶𝗲𝗿𝗙𝗹𝗼𝘄, enabling improved few-step generation in diffusion/flow models with just 15 mins of optimizing stochastic interpolant coefficients. 📅 Fri Apr 24 Afternoon, 𝗣𝟯-#𝟳𝟭𝟴 🌐 Web: https://t.co/ywAyn1enoD

MinhyukSung's tweet photo. #ICLR2026 [2/2]
This afternoon (Apr 24), we’ll present 𝗕é𝘇𝗶𝗲𝗿𝗙𝗹𝗼𝘄, enabling improved few-step generation in diffusion/flow models with just 15 mins of optimizing stochastic interpolant coefficients.

📅 Fri Apr 24 Afternoon, 𝗣𝟯-#𝟳𝟭𝟴
🌐 Web: https://t.co/ywAyn1enoD https://t.co/KeeYdlrZiH

0

38

4

16

2K

yuseungleee retweeted

Songyou Peng @songyoupeng

about 2 months ago

Yay, finally! Introducing Vision Banana🍌 from @GoogleDeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: https://t.co/GQgRi6mWwC (1/5)

56

2K

310

1K

283K

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

about 2 months ago

#ICLR2026 [1/2] On Friday morning, check out 𝗣𝗮𝗶𝗿𝗙𝗹𝗼𝘄, which enables higher-quality few-step generation in diffusion/flow-based models with only 0.2%–1.7% of the original training cost. 📅 𝗙𝗿𝗶 𝗔𝗽𝗿 𝟮𝟰 𝗠𝗼𝗿𝗻𝗶𝗻𝗴, 𝗣𝟯-#𝟭𝟴𝟬𝟰 🌐 Web: https://t.co/JQmpSN10m5

MinhyukSung's tweet photo. #ICLR2026 [1/2]
On Friday morning, check out 𝗣𝗮𝗶𝗿𝗙𝗹𝗼𝘄, which enables higher-quality few-step generation in diffusion/flow-based models with only 0.2%–1.7% of the original training cost.

📅 𝗙𝗿𝗶 𝗔𝗽𝗿 𝟮𝟰 𝗠𝗼𝗿𝗻𝗶𝗻𝗴, 𝗣𝟯-#𝟭𝟴𝟬𝟰
🌐 Web: https://t.co/JQmpSN10m5 https://t.co/o5dFITnN1c

1

39

3

24

2K

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

about 2 months ago

#ICLR2026 🇧🇷 Excited that our group will be presenting two main papers and two workshop papers at ICLR 2026! Please come check out our posters.

MinhyukSung's tweet photo. #ICLR2026 🇧🇷
Excited that our group will be presenting two main papers and two workshop papers at ICLR 2026!
Please come check out our posters. https://t.co/zjkqMYKBjD

1

12

3

0

716

yuseungleee retweeted

Minhyuk Sung @ CVPR @MinhyukSung

about 2 months ago

#ICLR2026 🇧🇷 Excited to present two papers from our group: BézierFlow & PairFlow, cutting diffusion training/fine-tuning from days to minutes. 𝗕é𝘇𝗶𝗲𝗿𝗙𝗹𝗼𝘄: https://t.co/7Iifiw7b19 𝗣𝗮𝗶𝗿𝗙𝗹𝗼𝘄: https://t.co/JQmpSN10m5

1

137

21

83

10K

Phillip (Yuseung) Lee

@yuseungleee

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users