Michael Ryoo @ryoo_michael - Twitter Profile

4 months ago

LVNet accepted to #EACL26! Training-free keyframe selector for long-video QA: 🎯High accuracy low caption,⚡up to 3.4x speed, ⚙️filters 1,800 to 24 keyframes on 1 GPU,💸10x cheaper LLM cost. Paper: https://t.co/wvrtJ2jmSc More details in the thread. ⬇️ Demo: LVNet (top)

2

5

3

2

410

ryoo_michael retweeted

Ziyang Wang

@ZiyangW00

6 months ago

🚨 Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding 🚨 Introducing Active Video Perception: an evidence-seeking framework that treats the video as an interactive environment and acquires compact, query-relevant evidence. 🎬 Key Highlights: 🧠 Human-Inspired Active Perception AVP mimics how humans watch video by first skiming for global context, then focusing on a few critical moments. It treats video as interactive environments. 🔄 Iterative Evidence Seeking AVP runs a Plan–Observe–Reflect loop, dynamically querying video parts for fine-grained evidence and continually assessing whether it has enough information or needs to look deeper. 🚀 Efficiency Breakthrough: High accuracy meets low cost. AVP outperforms the best agentic approach by +5.7% accuracy while using just 12.4% of tokens and 18.4% inference time. How does AVP transform passive video processing into active, agentic exploration? Dive into the details below! 🧵

3

74

32

28

30K

Michael Ryoo @ryoo_michael

9 months ago

Strefer auto-generates instruction data for tuning Video LLMs on space-time–focused video tasks. With just +545 short videos, Strefer-trained models outperform baselines on various tasks, showing stronger space–time–aware perception & reasoning !!

Salesforce AI Research

@SFResearch

9 months ago

(Thread 1/8) 🚨 Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data 🚨 Introducing Strefer: a novel data engine for auto-generating instruction data that enables Video LLMs to excel at spatiotemporal video understanding 🎬🧩⏳ Key Contributions: ▶️ Automated Pipeline: Eliminates dependence on legacy annotations through fully automatic instruction generation ▶️ Fine-grained Spatiotemporal Information: Produces temporally aligned, object-centric metadata with instruction-response pairs and multimodal prompts ▶️ Data-Efficient: Achieves improvements in space-time referring and reasoning with only 545 extra videos and no proprietary model dependencies 📄 Paper: https://t.co/d2mn1PtTkl 🌐 Project: https://t.co/LkPbRtyK0o 💻 Code: https://t.co/P0J35tvupO 🎥 YouTube (10-min video): https://t.co/aSswKaShyC How does Strefer lay the foundation for perceptually grounded, instruction-tuned Video LLMs? Dive into the researchers' walk-through below! 🧵

1

14

5

6K

0

6

0

1

311

Michael Ryoo @ryoo_michael

about 1 year ago

What we end up having at CoRL 2025 will depend on the result.

Conference on Robot Learning @corl_conf

about 1 year ago

#CoRL2025 poll: If there is a K-Pop performance by a Korean idol group at the banquet, would you enjoy it?

7

28

6

11K

0

5

0

535

Who to follow

Gedas Bertasius

@gberta227

Assistant Professor at UNC, previously a postdoc at Meta AI, PhD from UPenn, video understanding, multimodal AI, a basketball enthusiast.

Kanchana Ranasinghe

@kahnchana

🤖 Vision & Robotics Researcher @SFResearch 👨🏽‍💻 Former Intern @Apple MLR, @AIatMeta, @GoogleResearch, @mbzuai 💃🏻 Dancer in free time

Antoine Yang

@AntoineYang2

Staff Research Scientist @GoogleDeepMind, Gemini video & Omni 🎥. Prev: PhD @Inria & @ENS_ULM, MEng @Polytechnique.

Michael Ryoo @ryoo_michael

about 1 year ago

We show that the approach even allows "learning from human videos" to improve its performance. arxiv: https://t.co/EHDBROvp41 code: https://t.co/0WUkz8TB9C

0

4

0

195

Michael Ryoo @ryoo_michael

about 1 year ago

Introducing LangToMo, learning to use pixel motion forecasting as (universal) intermediate representations for robot control: https://t.co/1i8qlWyuxL

2

14

2

1K

Michael Ryoo @ryoo_michael

about 1 year ago

We present a new System1-System2 model; it uses image diffusion model as its high-level System2 to predict embodiment agnostic pixel-based representation. A Transformer-based System1 maps such universal representations to actual robot actions.

1

4

0

233

ryoo_michael retweeted

Conference on Robot Learning @corl_conf

about 1 year ago

#CoRL2025 Hey Robot Learning Community! CoRL 2025 will be held in Seoul, Korea, Sep 27 - 30. Submission deadline: Apr 30 AoE. It's two weeks to go! Information: https://t.co/6AVF7UHg8g We are excited to receive your great work on robot learning!

2

51

6

8

10K

Michael Ryoo @ryoo_michael

over 1 year ago

LLaRA will appear at #ICLR2025 !! It is an efficient transformation of a VLM into a robot VLA. For more details: https://t.co/lr46FPtcQF

Xiang Li @XiangLi54505720

over 1 year ago

(1/5) Excited to present our #ICLR2025 paper, LLaRA, at NYC CV Day! LLaRA efficiently transforms a pretrained Vision-Language Model (VLM) into a robot Vision-Language-Action (VLA) policy, even with a limited amount of training data. More details are in the thread. ⬇️

XiangLi54505720's tweet photo. (1/5)
Excited to present our #ICLR2025 paper, LLaRA, at NYC CV Day!
LLaRA efficiently transforms a pretrained Vision-Language Model (VLM) into a robot Vision-Language-Action (VLA) policy, even with a limited amount of training data.
More details are in the thread. ⬇️ https://t.co/gEKBnULIbO

1

44

6

15

13K

1

38

5

14

2K

ryoo_michael retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

🚨🎥🚨🎥🚨 xGen-MM-Vid (BLIP-3-Video) is now available on @huggingface! Our compact VLM achieves SOTA performance with just 32 tokens for video understanding. Features explicit temporal encoder + BLIP-3 architecture. Try it out! 🤗32 Token Model: https://t.co/S9mVhyXrMP 🤗128 Token Model: https://t.co/1juefgvHcg 📄Paper: https://t.co/910sKM7h19 🖥️Website: https://t.co/kvwcwKPUVC 🧵Research Refresher 👇 #ComputerVision #OpenAI #AIResearch #VLM (1/3) Despite using much fewer tokens and being smaller (4B vs. 34B), xGen-MM-Vid provides comparable video question-answering accuracies to SOTA.

SFResearch's tweet photo. 🚨🎥🚨🎥🚨 xGen-MM-Vid (BLIP-3-Video) is now available on @huggingface!

Our compact VLM achieves SOTA performance with just 32 tokens for video understanding. Features explicit temporal encoder + BLIP-3 architecture. Try it out!

🤗32 Token Model: https://t.co/S9mVhyXrMP
🤗128 Token Model: https://t.co/1juefgvHcg
📄Paper: https://t.co/910sKM7h19
🖥️Website: https://t.co/kvwcwKPUVC
🧵Research Refresher 👇

#ComputerVision #OpenAI #AIResearch #VLM

(1/3)
Despite using much fewer tokens and being smaller (4B vs. 34B), xGen-MM-Vid provides comparable video question-answering accuracies to SOTA.

1

10

5

6

2K

Michael Ryoo @ryoo_michael

over 1 year ago

CoRL 2025 will be co-located with Humanoids 2025 at the same venue!

0

18

1

3

9K

Michael Ryoo @ryoo_michael

over 1 year ago

I am extremely pleased to announce that CoRL 2025 will be in Seoul, Korea! The organizing team includes myself and @gupta_abhinav_ as general chairs, and @JosephLim_AI, @songshuran, and Hae-Won Park (KAIST) as program chairs.

ryoo_michael's tweet photo. I am extremely pleased to announce that CoRL 2025 will be in Seoul, Korea! The organizing team includes myself and @gupta_abhinav_ as general chairs, and @JosephLim_AI, @songshuran, and Hae-Won Park (KAIST) as program chairs. https://t.co/KwZewhK2eU

4

171

14

11

26K

Michael Ryoo @ryoo_michael

over 1 year ago

BLIP-3-Video is out!

Salesforce AI Research

@SFResearch

over 1 year ago

📢📢📢Introducing xGen-MM-Vid (BLIP-3-Video)! This highly efficient multimodal language model is laser-focused on video understanding. Compared to other models, xGen-MM-Vid represents a video with a fraction of the visual tokens (e.g., 32 vs. 4608 tokens). Paper: https://t.co/9333HUaQhE Website: https://t.co/kvwcwKQsLa Researcher’s 🧵:👇

3

75

14

38

13K

1

13

2

1

1K

ryoo_michael retweeted

AK

@_akhaliq

almost 2 years ago

Salesforce presents xGen-MM (BLIP-3) A Family of Open Large Multimodal Models discuss: https://t.co/SruEf7WSUx This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

_akhaliq's tweet photo. Salesforce presents xGen-MM (BLIP-3)

A Family of Open Large Multimodal Models

discuss: https://t.co/SruEf7WSUx

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

7

302

72

123

46K

Michael Ryoo @ryoo_michael

almost 2 years ago

Introducing LLaRA !!! https://t.co/lr46FPsF17 It's a new robot action model, dataset, and framework based on LLMs/VLMs. It's opensource and trainable at an academic scale (7B LLaVA-based), so you can finetune it for your robotics task!

Xiang Li @XiangLi54505720

almost 2 years ago

🚀 Excited to share our latest project: LLaRA - Supercharging Robot Learning Data for Vision-Language Policy! 🤖✨ We create a framework to turn robot expert trajectories into conversation-style data and other auxiliary data for instruction tuning. More details to come! (1/N)

XiangLi54505720's tweet photo. 🚀 Excited to share our latest project: LLaRA - Supercharging Robot Learning Data for Vision-Language Policy! 🤖✨

We create a framework to turn robot expert trajectories into conversation-style data and other auxiliary data for instruction tuning. More details to come! (1/N) https://t.co/ZhuvAn4MCk

4

69

19

24

7K

0

15

1

2

2K

ryoo_michael retweeted

Google DeepMind @GoogleDeepMind

almost 3 years ago

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more: https://t.co/UWAzrhTOJG

38

2K

433

276

538K

ryoo_michael retweeted

Karol Hausman

@hausman_k

almost 3 years ago

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: https://t.co/MhgZqCRfOC our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

16

571

113

228

183K

Michael Ryoo @ryoo_michael

almost 3 years ago

"Diffusion Illusions: Hiding Images in Plain Sight" received #CVPR2023 Outstanding Demo Award. https://t.co/JnWChmRb1w Congratulations @RyanBurgert @kahnchana @XiangLi54505720!

2

28

5

1

3K

ryoo_michael retweeted

Ted Xiao

@xiao_ted

almost 3 years ago

Looking forward to showcasing one of the first foundation models for robotics at #RSS2023 next week! Presenting "RT-1: Robotics Transformer for Real-world Control at Scale" from the Google DeepMind robotics team. Website: https://t.co/NtydvYFtMK Session: Tuesday 7/12, 3PM-5PM

1

78

17

14

9K

ryoo_michael retweeted

Xiang Li @XiangLi54505720

almost 3 years ago

Introducing Crossway Diffusion, a diffusion-based visuomotor policy taking advantage of SSL. In short: we add state decoders to reconstruct states during training diffusion policy and it works better. More at: https://t.co/3SPn4Y0yxC

XiangLi54505720's tweet photo. Introducing Crossway Diffusion, a diffusion-based visuomotor policy taking advantage of SSL. In short: we add state decoders to reconstruct states during training diffusion policy and it works better. More at: https://t.co/3SPn4Y0yxC https://t.co/MNxr0CuapC

0

8

3

1

845

Michael Ryoo

@ryoo_michael

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users