Viraj Prabhu

8 months ago

Check out our latest work on building Web Agents that Learn Tools (WALT) to get more done faster! 🧵👇🏻

Associate Professor @UCIrvine, formerly @GeorgiaTech. Researcher in Computer Vision, Machine Learning, AI. PhD from @berkeley_ai. Views my own.

8 months ago

(Thread 1/4) Announcing WALT — Web Agents that Learn Tools 🛠️ WALT reverse-engineers existing web automations (search, comment, filter) → reusable tools that allow agents to focus on higher-level reasoning rather than choreographing clicks. This abstraction transforms the agent’s computational burden: instead of figuring out “how do I search for X, then filter by Y, then sort by Z” through complex UI sequences, the agent now simply calls search(X), filter(Y), sort(Z)!

1

7

0

5

1K

0

7

0

444

virprabh retweeted

Caiming Xiong

@CaimingXiong

7 months ago

🚀 Introducing BLIP3o-NEXT from @SFResearch -- a fully open-source foundation model that unifies text-to-image generation and image editing within a single architecture. Key insights: 1️⃣ Architecture-wise: most design choices show comparable performance — what matters is scalability, inference speed and simplicity. 2️⃣ Reinforcement Learning (RL): applying RL to image generation (especially the autoregressive part) can push the frontier further. 3️⃣ Image editing remains challenging: improving consistency between generated and reference images is still an open problem — we improve it via post-training, data-engineering and VAE feature conditioning. 4️⃣ Data quality & scale: remains a decisive determinant of performance upper bound. 🔧 On the technical side: we adopt an Autoregressive + Diffusion hybrid — the AR model generates discrete image tokens conditioned on multimodal input, then those hidden states condition a diffusion model to produce fine-grained, high-fidelity images. 🧪 In image editing, we integrate VAE latents (from the reference image) in two ways: (a) cross-attention into the diffusion model, (b) noise-space injection. Combining both gives best consistency. 📂 And yes: model, code, datasets, evaluation pipelines are fully released under open-source — meaning the community can reproduce & build on this. 📄 Paper: https://t.co/2aq0bU6RQv 🤗 Resources: https://t.co/dtRnFYkuI5 💻 Code: https://t.co/MpVyKKb7hL #BLIP3oNEXT #OpenSourceAI

CaimingXiong's tweet photo. 🚀 Introducing BLIP3o-NEXT from @SFResearch -- a fully open-source foundation model that unifies text-to-image generation and image editing within a single architecture. Key insights:
1️⃣ Architecture-wise: most design choices show comparable performance — what matters is scalability, inference speed and simplicity.
2️⃣ Reinforcement Learning (RL): applying RL to image generation (especially the autoregressive part) can push the frontier further.
3️⃣ Image editing remains challenging: improving consistency between generated and reference images is still an open problem — we improve it via post-training, data-engineering and VAE feature conditioning.
4️⃣ Data quality & scale: remains a decisive determinant of performance upper bound.

🔧 On the technical side: we adopt an Autoregressive + Diffusion hybrid — the AR model generates discrete image tokens conditioned on multimodal input, then those hidden states condition a diffusion model to produce fine-grained, high-fidelity images.
🧪 In image editing, we integrate VAE latents (from the reference image) in two ways: (a) cross-attention into the diffusion model, (b) noise-space injection. Combining both gives best consistency.

📂 And yes: model, code, datasets, evaluation pipelines are fully released under open-source — meaning the community can reproduce & build on this.

📄 Paper: https://t.co/2aq0bU6RQv
🤗 Resources:
https://t.co/dtRnFYkuI5
💻 Code: https://t.co/MpVyKKb7hL
#BLIP3oNEXT #OpenSourceAI

1

21

7

5

2K

virprabh retweeted

Li Junnan

@LiJunnan0409

8 months ago

Browser agents — and agents in general — should learn to discover and use higher-level skills rather than executing low-level atomic actions. WALT turns unsupervised web interactions into structured, reusable skills, enabling agents to act with fewer steps and greater reliability than low-level click-based control.

0

18

1

6

3K

virprabh retweeted

Caiming Xiong

@CaimingXiong

8 months ago

Humans don’t just use tools — we invent them. That’s the next frontier for AI agents. At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools. Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence. We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication. WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps. Paper: https://t.co/Hm6ORanVWn Code: https://t.co/akK25VuyDf @virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen @stanleyran

CaimingXiong's tweet photo. Humans don’t just use tools — we invent them.
That’s the next frontier for AI agents.
At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools.

Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence.

We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication.
WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps.

Paper: https://t.co/Hm6ORanVWn
Code: https://t.co/akK25VuyDf

@virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen @stanleyran

6

114

18

81

10K

Who to follow

Judy Hoffman

@judyfhoffman

Karan Desai (KD)

@kdexd

Building @theworldlabs, prev: PhD @UMichCSE. I fight the devil in the details 🧐

Ani Kembhavi

@anikembhavi

AI Research @meta Former Director @wayve_ai @allen_ai. Best/Outstanding papers at CVPR, Neurips, CoRL, IROS and ICRA.

virprabh retweeted

8 months ago

(3/4) Outcome: up to 30% higher success rates with 1.4x fewer steps / LLM-calls (new SoTA on VisualWebArena) 📈 Here’s another example of finding stay options on Airbnb: Baseline web agent (left), WALT agent (right).

1

6

2

0

5K

virprabh retweeted

8 months ago

(4/4) We provide a simple CLI for discovery/serving (MCP) with WALT – try it out with 🚀walt discover <your-url>; walt agent <your-task> --start-url <your-url> 📝 Paper: https://t.co/W5VJlitRlr 🔗 Code: https://t.co/uFZKuNdEBh Authors: @virprabh, @yutong_dai, Matthew Fernandez, @jinggu4ai, Krithika Ramakrishnan, @luo_yanqi, @silviocinguetta, @CaimingXiong, @LiJunnan0409, @ZeyuanChen, and @stanleyran. #EnterpriseAI #FutureOfAI #WebAgents #LLM #Automation

0

9

2

3

643

virprabh retweeted

Devi Parikh

@deviparikh

8 months ago

Thank you to the award committee and the broader vision community for the recognition. After all these (21!) years and so many conferences across sub-disciplines in AI, the vision community continues to feel like home. What makes this extra special is that the original VQA paper, where we first introduced the VQA task and v1 of the dataset, was published at ICCV, exactly 10 years ago! “We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer….” It is quite simply ridiculous how far the field has come since! Congratulations to all the VQA authors, and the VQA challenge + workshop organizers over the years! GG :) #ICCV2025

deviparikh's tweet photo. Thank you to the award committee and the broader vision community for the recognition. After all these (21!) years and so many conferences across sub-disciplines in AI, the vision community continues to feel like home.

What makes this extra special is that the original VQA paper, where we first introduced the VQA task and v1 of the dataset, was published at ICCV, exactly 10 years ago!

“We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer….” It is quite simply ridiculous how far the field has come since!

Congratulations to all the VQA authors, and the VQA challenge + workshop organizers over the years! GG :)

#ICCV2025

15

215

12

11

29K

8 months ago

I'll be presenting this at the first poster session tomorrow (Oct 21, 11.45am, Exhibit Hall I #301) – stop by if you're attending #ICCV2025! 🏖️

virprabh's tweet photo. I'll be presenting this at the first poster session tomorrow (Oct 21, 11.45am, Exhibit Hall I #301) – stop by if you're attending #ICCV2025! 🏖️ https://t.co/GxVGpB98vZ

over 1 year ago

💥 Super excited to introduce our latest work on **programmatically** benchmarking vision-language models in the wild 👇

1

16

4

1

2K

0

2

0

159

virprabh retweeted

Linxin Song

@linxins2

10 months ago

Thank you so much Caiming! We show that involving coding as a new type of action apart from GUI action for CUA can significantly help improve the computer-using performance while reducing the total actions for task solving. If you are interested in it, please take a look at our newly released paper: https://t.co/ei57preo8g

0

17

6

3

6K

virprabh retweeted

Caiming Xiong

@CaimingXiong

10 months ago

🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems. 12/2024 we released Aguvis (https://t.co/PjO1FQn4Ck) 07/2024 we released GTA1 (https://t.co/wkCjfmXWC7) Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark. Takeaways - Treat code as an action, not just a tool call. - Hybrid action space (code + GUI) reduces error accumulation and boosts reliability. - New SOTA on OSWorld with better efficiency and broader applicability. Paper: https://t.co/Pk7isDcsnd Page: https://t.co/xwQl1KOEYJ

CaimingXiong's tweet photo. 🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems.

12/2024 we released Aguvis (https://t.co/PjO1FQn4Ck)
07/2024 we released GTA1 (https://t.co/wkCjfmXWC7)

Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark.

Takeaways
- Treat code as an action, not just a tool call.
- Hybrid action space (code + GUI) reduces error accumulation and boosts reliability.
- New SOTA on OSWorld with better efficiency and broader applicability.

Paper: https://t.co/Pk7isDcsnd
Page: https://t.co/xwQl1KOEYJ

3

204

42

141

33K

about 1 year ago

Happening now in 208B, come check out the first EMACS workshop! #CVPR2025

Experimental Model Auditing Workshop @CVPR2025 @emacscvpr25

over 1 year ago

Join us at the first-ever EMACS workshop @CVPR! 🚨 Submissions open March 5: https://t.co/R6armqCk2R See you in Nashville! 🎸 #CVPR2025

0

2

0

769

0

6

0

392

virprabh retweeted

about 1 year ago

🚨🚨 Paper submission deadline extended to May 4. Submit your work (in-progress or complete!) to the EMACS workshop @CVPR2025 in Nashville! Submission link: https://t.co/05Nr8zQNJx #CVPR2025 #GenerativeAI #bias

0

2

4

0

1K

virprabh retweeted

Judy Hoffman @judyfhoffman

about 1 year ago

🚀 Excited about how generative AI can power experimental (not just observational) audits of ML systems that reveal actionable insights into performance and bias? Join us at EMACS (Experimental Model Auditing with Controllable Synthesis) workshop @CVPR! https://t.co/JwQayb5wNu

2

16

7

3

4K

Experimental Model Auditing Workshop @CVPR2025 @emacscvpr25

over 1 year ago

Join us at the first-ever EMACS workshop @CVPR! 🚨 Submissions open March 5: https://t.co/R6armqCk2R See you in Nashville! 🎸 #CVPR2025

over 1 year ago

🚀 Excited about how generative AI can power experimental (not just observational) audits of ML systems that reveal actionable insights into performance and bias? Join us at the first-ever EMACS workshop @CVPR2025 in Nashville! 🌟 Speakers & submissions: https://t.co/nskBOrnkyE

0

4

1

0

2K

0

2

0

769

virprabh retweeted

Fiona Ryan @fionakryan

over 1 year ago

Introducing Gaze-LLE, a new model for gaze target estimation built on top of a frozen visual foundation model! Gaze-LLE achieves SOTA results on multiple benchmarks while learning minimal parameters, and shows strong generalization paper: https://t.co/Is2NgrrurO

78

4K

476

3K

428K

Pratik Ramesh @pratikramesh7

over 1 year ago

Looking forward to some Miami sun this week at #EMNLP2024, my first NLP conference in ~7 years! ☀️ HMU if you’d like to learn more about our work at @SFResearch or just meet/catch up! 🍹

0

5

0

213

virprabh retweeted

over 1 year ago

🤔Ever wondered why merging LoRA models is trickier than fully-finetuned ones? 🔍We explore this and discover that poor alignment b/w LoRA models lead to subpar merging. 💡The solution? KnOTS🪢— our latest work that uses SVD to improve alignment and boosts SOTA merging methods.

1

21

6

7

2K

virprabh retweeted

Simar Kareer @simar_kareer

over 1 year ago

Introducing EgoMimic - just wear a pair of Project Aria @meta_aria smart glasses 👓 to scale up your imitation learning datasets! Check out what our robot can do. A thread below👇

10

238

54

81

49K

over 1 year ago

And for those of you who prefer consuming papers as podcasts (!), here's NotebookLM doing a better job of explaining mine than I ever could: https://t.co/OtGQStvqRo

0

1

0

1

107

over 1 year ago

💥 Super excited to introduce our latest work on **programmatically** benchmarking vision-language models in the wild 👇