Ran Xu

@stanleyran

Research Director @ Salesforce AI Research

Joined August 2010

225 Following

63 Followers

72 Posts

stanleyran retweeted

Salesforce AI Research

@SFResearch

7 months ago

🎨 Introducing BLIP3o-NEXT: The Next Frontier of Native Image Generation 📄 Paper: https://t.co/zpCKImChM1 A fully open-science foundation model that unifies text-to-image generation AND image editing in one architecture. 🖼️✨ Key innovations: ➡️ Autoregressive + Diffusion design combining reasoning + fine-detail rendering 🧠 ➡️ First successful RL application for native image generation (GRPO on discrete tokens) 🎯 ➡️ State-of-the-art performance on GenEval & image editing benchmarks 📊 #FutureOfAI #EnterpriseAI ImageGeneration #DiffusionModels #ReinforcementLearning #ImageEditing #ComputerVision #MachineLearning #AI

stanleyran retweeted

Caiming Xiong

@CaimingXiong

8 months ago

Humans don’t just use tools — we invent them. That’s the next frontier for AI agents. At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools. Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence. We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication. WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps. Paper: https://t.co/Hm6ORanVWn Code: https://t.co/akK25VuyDf @virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen @stanleyran

CaimingXiong's tweet photo. Humans don’t just use tools — we invent them.
That’s the next frontier for AI agents.
At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools.

Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence.

We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication.
WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps.

Paper: https://t.co/Hm6ORanVWn
Code: https://t.co/akK25VuyDf

@virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen @stanleyran

114

10K

stanleyran retweeted

Caiming Xiong

@CaimingXiong

10 months ago

🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems. 12/2024 we released Aguvis (https://t.co/PjO1FQn4Ck) 07/2024 we released GTA1 (https://t.co/wkCjfmXWC7) Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark. Takeaways - Treat code as an action, not just a tool call. - Hybrid action space (code + GUI) reduces error accumulation and boosts reliability. - New SOTA on OSWorld with better efficiency and broader applicability. Paper: https://t.co/Pk7isDcsnd Page: https://t.co/xwQl1KOEYJ

CaimingXiong's tweet photo. 🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems.

12/2024 we released Aguvis (https://t.co/PjO1FQn4Ck)
07/2024 we released GTA1 (https://t.co/wkCjfmXWC7)

Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark.

Takeaways
- Treat code as an action, not just a tool call.
- Hybrid action space (code + GUI) reduces error accumulation and boosts reliability.
- New SOTA on OSWorld with better efficiency and broader applicability.

Paper: https://t.co/Pk7isDcsnd
Page: https://t.co/xwQl1KOEYJ

204

141

33K

stanleyran retweeted

Salesforce AI Research

@SFResearch

12 months ago

From Flow Generalists to Champions: Building #AgenticAI for Salesforce Automation💻 Introducing Enterprise General Intelligence (#EGI) models for Salesforce Flow automation! Blog: https://t.co/RXELbt1Gzq Unlike frontier LLMs that treat this as token generation, our EGI approach: ✅ Encodes enterprise domain knowledge in a custom DSL ✅ Trains in Flow Simulator with continuous self-improvement ✅ Achieves 50% relative improvement with 88% less data EGI isn't just better AI—it's AI purpose-built for enterprise. 32% → 48% activation rate on complex flows proves it works. #EnterpriseAI #FutureOfAI

SFResearch's tweet photo. From Flow Generalists to Champions: Building #AgenticAI for Salesforce Automation💻 Introducing Enterprise General Intelligence (#EGI) models for Salesforce Flow automation!

Blog: https://t.co/RXELbt1Gzq

Unlike frontier LLMs that treat this as token generation, our EGI approach:
✅ Encodes enterprise domain knowledge in a custom DSL
✅ Trains in Flow Simulator with continuous self-improvement
✅ Achieves 50% relative improvement with 88% less data

EGI isn't just better AI—it's AI purpose-built for enterprise. 32% → 48% activation rate on complex flows proves it works.

#EnterpriseAI #FutureOfAI

Who to follow

Arez 🍯

@arezcrypto22

The Devil and I got along just fine | $FTM $AVAX

Researcher@Apple (Ex Senior researcher @AI2, PhD@Georgia Tech)

stanleyran retweeted

Li Junnan

@LiJunnan0409

about 1 year ago

🚀 We’re open-sourcing Grounding-R1 — a series of SoTA models for GUI Grounding, trained with RL using a simple click-based reward. 🧠 Dive into our blog post: “GRPO for GUI Grounding Done Right” for the full training recipe. https://t.co/MAznrgsbAf

LiJunnan0409's tweet photo. 🚀 We’re open-sourcing Grounding-R1 — a series of SoTA models for GUI Grounding, trained with RL using a simple click-based reward.
🧠 Dive into our blog post: “GRPO for GUI Grounding Done Right” for the full training recipe.
https://t.co/MAznrgsbAf https://t.co/8CR6QbU34w

110

stanleyran retweeted

Salesforce AI Research

@SFResearch

about 1 year ago

🚨NEW MODEL: BLIP3-o 🚨 🔬 Researchers from @SFResearch + @ml_umd introduce BLIP3-o: solving AI's dual challenge of building ONE model that both understands AND generates images at SOTA level. 💡 Key innovation: dual-stage training with frozen autoregressive backbone prevents task interference - the model excels at both understanding and generation simultaneously. 🔓 Open source for the research community: https://t.co/dNOVoVT7R7 🤗 Model: https://t.co/mdldqMN1n2 💻 Demo: https://t.co/Tni7UVZD4X 📎 Blog: https://t.co/p5hhkcFXJg 🗞️ Feature: https://t.co/EMOtgsmgly #FutureOfAI #EnterpriseAI #OpenScience @github @Marktechpost

SFResearch's tweet photo. 🚨NEW MODEL: BLIP3-o 🚨

🔬 Researchers from @SFResearch + @ml_umd introduce BLIP3-o: solving AI's dual challenge of building ONE model that both understands AND generates images at SOTA level.

💡 Key innovation: dual-stage training with frozen autoregressive backbone prevents task interference - the model excels at both understanding and generation simultaneously.

🔓 Open source for the research community: https://t.co/dNOVoVT7R7
🤗 Model: https://t.co/mdldqMN1n2
💻 Demo: https://t.co/Tni7UVZD4X
📎 Blog: https://t.co/p5hhkcFXJg
🗞️ Feature: https://t.co/EMOtgsmgly

#FutureOfAI #EnterpriseAI #OpenScience @github @Marktechpost

11K

stanleyran retweeted

Salesforce AI Research

@SFResearch

about 1 year ago

We're thrilled to announce BLIP3-o, a breakthrough in unified multimodal models that excels at both image understanding and generation in a single autoregressive architecture! 💫 📊 Paper: https://t.co/M6oqNcyQqc 🤗 Models: https://t.co/pNwvTDbvsq 🧠 Code: https://t.co/lLmWFZxYjr 📽️ Learn on the go (AI Generated): https://t.co/2Cu4f8aRSI Our research reveals that using CLIP features with diffusion transformer and flow matching creates superior performance while reducing computational complexity. Most importantly, we're making this model family available to the AI Research community: ▶️ Complete model implementations ▶️ Model weights ▶️ 25M+ detailed caption pretrain dataset ▶️ 60K high-quality instruction tuning dataset Advance your multimodal AI research and share your findings in the comments. (And thanks for the shout, @_akhaliq!)

SFResearch's tweet photo. We're thrilled to announce BLIP3-o, a breakthrough in unified multimodal models that excels at both image understanding and generation in a single autoregressive architecture! 💫

📊 Paper: https://t.co/M6oqNcyQqc
🤗 Models: https://t.co/pNwvTDbvsq
🧠 Code: https://t.co/lLmWFZxYjr
📽️ Learn on the go (AI Generated): https://t.co/2Cu4f8aRSI

Our research reveals that using CLIP features with diffusion transformer and flow matching creates superior performance while reducing computational complexity.
Most importantly, we're making this model family available to the AI Research community:

▶️ Complete model implementations
▶️ Model weights
▶️ 25M+ detailed caption pretrain dataset
▶️ 60K high-quality instruction tuning dataset

Advance your multimodal AI research and share your findings in the comments. (And thanks for the shout, @_akhaliq!)

stanleyran retweeted

@_akhaliq

about 1 year ago

Salesforce just dropped BLIP3-o on Hugging Face A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

_akhaliq's tweet photo. Salesforce just dropped BLIP3-o on Hugging Face

A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset https://t.co/AKuVc6N3gw

183

31K

stanleyran retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

🚨🎥🚨🎥🚨 xGen-MM-Vid (BLIP-3-Video) is now available on @huggingface! Our compact VLM achieves SOTA performance with just 32 tokens for video understanding. Features explicit temporal encoder + BLIP-3 architecture. Try it out! 🤗32 Token Model: https://t.co/S9mVhyXrMP 🤗128 Token Model: https://t.co/1juefgvHcg 📄Paper: https://t.co/910sKM7h19 🖥️Website: https://t.co/kvwcwKPUVC 🧵Research Refresher 👇 #ComputerVision #OpenAI #AIResearch #VLM (1/3) Despite using much fewer tokens and being smaller (4B vs. 34B), xGen-MM-Vid provides comparable video question-answering accuracies to SOTA.

SFResearch's tweet photo. 🚨🎥🚨🎥🚨 xGen-MM-Vid (BLIP-3-Video) is now available on @huggingface!

Our compact VLM achieves SOTA performance with just 32 tokens for video understanding. Features explicit temporal encoder + BLIP-3 architecture. Try it out!

🤗32 Token Model: https://t.co/S9mVhyXrMP
🤗128 Token Model: https://t.co/1juefgvHcg
📄Paper: https://t.co/910sKM7h19
🖥️Website: https://t.co/kvwcwKPUVC
🧵Research Refresher 👇

#ComputerVision #OpenAI #AIResearch #VLM

(1/3)
Despite using much fewer tokens and being smaller (4B vs. 34B), xGen-MM-Vid provides comparable video question-answering accuracies to SOTA.

stanleyran retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

🔬🔬🔬Introducing ProVision: A new system for transforming images into verified instruction data for multimodal language models (MLMs) at massive scale! Scene graphs + programmatic synthesis generate 10M+ diverse, automated Q&A pairs. Fully verifiable. Training MLMs? Dive in: 📰Blog: https://t.co/ILrIENeThS 🗞️Paper: https://t.co/gR2FHx2WrB 💻Dataset: https://t.co/nt2BlmdE40 👇Researcher’s 🧵👇 (1/6) Why build ProVision? Training multimodal LMs demands massive instruction datasets - pairing images with Q&As. Manual creation is costly, while using existing models risks hallucinations. ProVision's novel solution? Scene graphs + human-written programs. We represent images as structured graphs capturing objects, attributes & relationships. We then use Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph. 👇🧵 for more...

SFResearch's tweet photo. 🔬🔬🔬Introducing ProVision: A new system for transforming images into verified instruction data for multimodal language models (MLMs) at massive scale!
Scene graphs + programmatic synthesis generate 10M+ diverse, automated Q&A pairs. Fully verifiable.

Training MLMs? Dive in:
📰Blog: https://t.co/ILrIENeThS
🗞️Paper: https://t.co/gR2FHx2WrB
💻Dataset: https://t.co/nt2BlmdE40
👇Researcher’s 🧵👇

(1/6) Why build ProVision?
Training multimodal LMs demands massive instruction datasets - pairing images with Q&As. Manual creation is costly, while using existing models risks hallucinations.

ProVision's novel solution? Scene graphs + human-written programs. We represent images as structured graphs capturing objects, attributes & relationships. We then use Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph.

👇🧵 for more...

112

21K

stanleyran retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

🚨🚨🚨Introducing PROVE: A new programmatic benchmark for evaluating vision-language models (VLMs). VLMs often provide responses that are unhelpful, contain false claims about the image, or both. However, benchmarking this in the wild can be surprisingly hard! Enter PROVE, which: 💥 Includes challenging visual QA pairs that are *grounded by design* 💥 Provides a programmatic evaluation framework to quantify response *helpfulness* and *truthfulness* 🕹️ Explore: https://t.co/rHbey37C61 🤗 Data: https://t.co/E5bOC79aQo 📎 Paper: https://t.co/JCEmAuW7pF 🧵 Details in comments 👇

stanleyran retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

📢📢📢Introducing xGen-MM-Vid (BLIP-3-Video)! This highly efficient multimodal language model is laser-focused on video understanding. Compared to other models, xGen-MM-Vid represents a video with a fraction of the visual tokens (e.g., 32 vs. 4608 tokens). Paper: https://t.co/9333HUaQhE Website: https://t.co/kvwcwKQsLa Researcher’s 🧵:👇

13K

stanleyran retweeted

Juan Carlos Niebles @CVPR @jcniebles

over 1 year ago

🏃🏻Swing by the ongoing poster session Amber 5 before it’s over! Our team is here to chat about xGen-MM (BLIP3) . #ECCV2024 @eccvconf

stanleyran retweeted

Salesforce AI Research

@SFResearch

over 1 year ago

🇮🇹🚀💥Headed to #ECCV2024? Bookmark this for a deep dive into our team’s groundbreaking research across multiple domains.👇 SUNDAY 29th SEPT (All times CEST) xGen-VideoSyn-1: Setting new standards in text-to-video synthesis 10:38 — 11:20am Room: Amber 7+ 8 📝 AI4VA Workshop: https://t.co/qMJxEIAHP8 📚Paper: https://t.co/FvEWugF6au — MONDAY 30 SEPT: ECCV2024 Workshop on Multimodal Agents 8:30am — 12:30pm Room: Amber 7 + 8 📝Workshop: https://t.co/DWxhzhsDlJ BootPIG: Bootstrapping zero-shot personalized image generation 17:40 —17:55 (5:40pm - 5:55pm) Room Space 2 📝Synthetic Data4CV Workshop: https://t.co/yEsceZjLYi 📚Poster / Paper: https://t.co/OCKylEOWyi xGen-MM (BLIP-3): A groundbreaking family of multimodal models 16:00 — 20:00 (4-8pm) Rom: Amber 5 📝EVAL-FoMo 24 Workshop: https://t.co/CNeSQR60u6 📚Poster / Paper: https://t.co/k4xqRP8avN — WEDNESDAY 2 OCT LayoutDETR: Redefining multimodal layout design 10:30am — 12:30pm 📚Poster / Paper: https://t.co/rgfNwnC9qk X-InstructBLIP: Pioneering cross-modal reasoning 16:30 —18:30 (4:30 - 6:30pm) 📚Poster / Paper: https://t.co/WzlE664mZy — FRIDAY 4 OCT SQ-LLaVA: Self-questioning in vision-language AI 10:30am - 12:30pm 📚Poster / Paper: https://t.co/JLHquPrczu See you in Milan, @eccvconf 🤖 #AIResearch #ComputerVision

23K

stanleyran retweeted

Silvio Savarese

@silviocinguetta

almost 2 years ago

Happy to see our team's hard work come to fruition. The xLAM family of models represents a huge leap in AI capabilities for function calling, planning and reasoning—fit-for-purpose for varied needs of modern business. Eager to see where its application takes us! #AIInnovation

stanleyran retweeted

@_akhaliq

almost 2 years ago

Salesforce presents xGen-VideoSyn-1 High-fidelity Text-to-Video Synthesis with Compressed Representations discuss: https://t.co/wm3SUGzxaF We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

_akhaliq's tweet photo. Salesforce presents xGen-VideoSyn-1

High-fidelity Text-to-Video Synthesis with Compressed Representations

discuss: https://t.co/wm3SUGzxaF

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

128

18K

Ran Xu @stanleyran

almost 2 years ago

RT @SFResearch: Breaking news! ➡️➡️➡️ We just released the MINT-1T 🍃dataset! One trillion tokens. Multimodal. Interleaved. Open-source. Pe…

Ran Xu @stanleyran

about 2 years ago

Releasing the fist of new series of blip - #BLIP3 , more to come!

323

stanleyran retweeted

Caiming Xiong

@CaimingXiong

over 2 years ago

Excited to share our brand new LLM evaluation benchmark 🐠FoFo🐠 on format-following! 🐠FOFO🐠 is a pioneering benchmark for evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats, a crucial yet under-examined capability for their application as AI agents. Link: https://t.co/qBETnrar8r Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: 1. open-source models significantly lag behind closed-source ones in format adherence; 2. LLMs’ format-following performance is independent of their content generation quality; 3. LLMs’ format proficiency varies across different domains. These observations suggest two key points: i) The format-following capacity of LLMs appears independent of their content-following capacity shown in AlpacaEval and MT-Bench, and may necessitate specialized alignment fine-tuning beyond the conventional instruction-tuning of open source LLMs. ii) Format-following capacity is not universally transferable across domains, highlighting the potential utility of our benchmark as a guiding and probing tool for selecting domain-specific AI agent foundation models.

12K

Ran Xu

@stanleyran

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users