(Thread 1/4) Announcing WALT — Web Agents that Learn Tools 🛠️
WALT reverse-engineers existing web automations (search, comment, filter) → reusable tools that allow agents to focus on higher-level reasoning rather than choreographing clicks.
This abstraction transforms the agent’s computational burden: instead of figuring out “how do I search for X, then filter by Y, then sort by Z” through complex UI sequences, the agent now simply calls search(X), filter(Y), sort(Z)!
🚀 Introducing BLIP3o-NEXT from @SFResearch -- a fully open-source foundation model that unifies text-to-image generation and image editing within a single architecture. Key insights:
1️⃣ Architecture-wise: most design choices show comparable performance — what matters is scalability, inference speed and simplicity.
2️⃣ Reinforcement Learning (RL): applying RL to image generation (especially the autoregressive part) can push the frontier further.
3️⃣ Image editing remains challenging: improving consistency between generated and reference images is still an open problem — we improve it via post-training, data-engineering and VAE feature conditioning.
4️⃣ Data quality & scale: remains a decisive determinant of performance upper bound.
🔧 On the technical side: we adopt an Autoregressive + Diffusion hybrid — the AR model generates discrete image tokens conditioned on multimodal input, then those hidden states condition a diffusion model to produce fine-grained, high-fidelity images.
🧪 In image editing, we integrate VAE latents (from the reference image) in two ways: (a) cross-attention into the diffusion model, (b) noise-space injection. Combining both gives best consistency.
📂 And yes: model, code, datasets, evaluation pipelines are fully released under open-source — meaning the community can reproduce & build on this.
📄 Paper: https://t.co/2aq0bU6RQv
🤗 Resources:
https://t.co/dtRnFYkuI5
💻 Code: https://t.co/MpVyKKb7hL
#BLIP3oNEXT #OpenSourceAI
Browser agents — and agents in general — should learn to discover and use higher-level skills rather than executing low-level atomic actions.
WALT turns unsupervised web interactions into structured, reusable skills, enabling agents to act with fewer steps and greater reliability than low-level click-based control.
Humans don’t just use tools — we invent them.
That’s the next frontier for AI agents.
At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools.
Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence.
We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication.
WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps.
Paper: https://t.co/Hm6ORanVWn
Code: https://t.co/akK25VuyDf
@virprabh@yutong_dai@jinggu4ai@luo_yanqi@silviocinguetta@LiJunnan0409@ZeyuanChen@stanleyran
(3/4) Outcome: up to 30% higher success rates with 1.4x fewer steps / LLM-calls (new SoTA on VisualWebArena) 📈
Here’s another example of finding stay options on Airbnb: Baseline web agent (left), WALT agent (right).
(4/4) We provide a simple CLI for discovery/serving (MCP) with WALT – try it out with
🚀walt discover <your-url>; walt agent <your-task> --start-url <your-url>
📝 Paper: https://t.co/W5VJlitRlr
🔗 Code: https://t.co/uFZKuNdEBh
Authors: @virprabh, @yutong_dai, Matthew Fernandez, @jinggu4ai, Krithika Ramakrishnan, @luo_yanqi, @silviocinguetta, @CaimingXiong, @LiJunnan0409, @ZeyuanChen, and @stanleyran.
#EnterpriseAI #FutureOfAI #WebAgents #LLM #Automation
Thank you to the award committee and the broader vision community for the recognition. After all these (21!) years and so many conferences across sub-disciplines in AI, the vision community continues to feel like home.
What makes this extra special is that the original VQA paper, where we first introduced the VQA task and v1 of the dataset, was published at ICCV, exactly 10 years ago!
“We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer….” It is quite simply ridiculous how far the field has come since!
Congratulations to all the VQA authors, and the VQA challenge + workshop organizers over the years! GG :)
#ICCV2025
Thank you so much Caiming!
We show that involving coding as a new type of action apart from GUI action for CUA can significantly help improve the computer-using performance while reducing the total actions for task solving.
If you are interested in it, please take a look at our newly released paper: https://t.co/ei57preo8g
🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems.
12/2024 we released Aguvis (https://t.co/PjO1FQn4Ck)
07/2024 we released GTA1 (https://t.co/wkCjfmXWC7)
Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark.
Takeaways
- Treat code as an action, not just a tool call.
- Hybrid action space (code + GUI) reduces error accumulation and boosts reliability.
- New SOTA on OSWorld with better efficiency and broader applicability.
Paper: https://t.co/Pk7isDcsnd
Page: https://t.co/xwQl1KOEYJ
🚨🚨 Paper submission deadline extended to May 4.
Submit your work (in-progress or complete!) to the EMACS workshop @CVPR2025 in Nashville!
Submission link: https://t.co/05Nr8zQNJx
#CVPR2025#GenerativeAI#bias
🚀 Excited about how generative AI can power experimental (not just observational) audits of ML systems that reveal actionable insights into performance and bias?
Join us at EMACS (Experimental Model Auditing with Controllable Synthesis) workshop @CVPR!
https://t.co/JwQayb5wNu
🚀 Excited about how generative AI can power experimental (not just observational) audits of ML systems that reveal actionable insights into performance and bias?
Join us at the first-ever EMACS workshop @CVPR2025 in Nashville!
🌟 Speakers & submissions: https://t.co/nskBOrnkyE
Introducing Gaze-LLE, a new model for gaze target estimation built on top of a frozen visual foundation model!
Gaze-LLE achieves SOTA results on multiple benchmarks while learning minimal parameters, and shows strong generalization
paper: https://t.co/Is2NgrrurO
Looking forward to some Miami sun this week at #EMNLP2024, my first NLP conference in ~7 years! ☀️
HMU if you’d like to learn more about our work at @SFResearch or just meet/catch up! 🍹
🤔Ever wondered why merging LoRA models is trickier than fully-finetuned ones?
🔍We explore this and discover that poor alignment b/w LoRA models lead to subpar merging.
💡The solution? KnOTS🪢— our latest work that uses SVD to improve alignment and boosts SOTA merging methods.
Introducing EgoMimic - just wear a pair of Project Aria @meta_aria smart glasses 👓 to scale up your imitation learning datasets!
Check out what our robot can do.
A thread below👇
And for those of you who prefer consuming papers as podcasts (!), here's NotebookLM doing a better job of explaining mine than I ever could: https://t.co/OtGQStvqRo
🚨🚨🚨Introducing PROVE: A new programmatic benchmark for evaluating vision-language models (VLMs).
VLMs often provide responses that are unhelpful, contain false claims about the image, or both. However, benchmarking this in the wild can be surprisingly hard! Enter PROVE, which:
💥 Includes challenging visual QA pairs that are *grounded by design*
💥 Provides a programmatic evaluation framework to quantify response *helpfulness* and *truthfulness*
🕹️ Explore: https://t.co/rHbey37C61
🤗 Data: https://t.co/E5bOC79aQo
📎 Paper: https://t.co/JCEmAuW7pF
🧵 Details in comments 👇