Introducing 🔥GPA🔥: GUI Process Automation from Salesforce AI Research
Project page: https://t.co/b3LGmmmaRp
Demo: https://t.co/fUHrqkf2w2
Submitting receipts, logging customer meeting notes, navigating clunky enterprise UIs… all on autopilot.
GPA learns directly from you by watching how you do a task once — mouse clicks, keyboard inputs, everything.
No brittle scripts. No manual setup. GPA is the next-gen RPA.
▶️ Introducing GPA: GUI Process Automation from Salesforce AI Research
🧑💻 Technical Blog and Demo: https://t.co/oO61VnHoFc
📎 Paper: https://t.co/Qmg6v3FnRq
Record one workflow demo. Replay it automatically — deterministic, fully local, and free.
Why does this matter? Most GUI agents send your screenshots to the cloud, burn tokens on every click, and still guess wrong 10% of the time. A new framework for RPA, GPA takes a different approach.
In pilot testing against Gemini 3 Pro's computer-use agent, GPA achieved 100% success rate at ~10× faster execution across 16 desktop GUI tasks.
No prompt engineering. No cloud calls. No randomness.
#EfficientAI #EnterpriseAI #FutureofAI
Deep research agents typically scale depth—more sequential steps. But what about scaling width? 🤔
📄 Paper: https://t.co/TLP3YBEHUZ
We introduce Wide & Deep (W&D) research agents: a framework exploring parallel tool calling to boost performance while reducing costs and latency.
Key results on BrowseComp, HLE, and GAIA:
📊 Parallel tool calling improves accuracy across GPT-5, Gemini, and Claude 💰 36% reduction in API costs, 41% reduction in wall-clock time 🎯 W&D with GPT-5-Medium achieves 62.2% on BrowseComp—beating GPT-5-High's 54.9%
Why it works:
🔍 Enhanced source credibility through diverse information gathering
✅ Tool result verification catches unreliable outputs
🧩 Query decomposition improves retrieval effectiveness
We also tested tool call schedulers. A "descending" strategy—explore early, exploit later—added another ~6% gain. 📈
Unlike complex multi-agent orchestration, W&D uses intrinsic parallel tool calling within a single reasoning step, making it easy to integrate into existing agent frameworks.
🌐 Website / Code: https://t.co/FajxSDV5XY
Authors: Xiaoqiang Lin @xiaoqiang_98, Jun Hao Liew @jhliew91, Silvio Savarese @silviocinguetta, and Junnan Li @LiJunnan0409 at @Salesforce AI Research.
#FutureOfAI #EnterpriseAI #AIAgents
We introduce 🔍Wide & Deep (W&D) research agents: scale width by making more parallel tool calls per turn.
Scaling width boosts accuracy on BrowseComp, HLE, and GAIA — while cutting turns, API cost, and wall-clock time.
A simple descending scheduler (explore early → exploit later) adds another ~6% gain.
Our W&D agent with GPT-5-medium hits 62.2% on BrowseComp, beating GPT-5-high deep research (54.9%).
📄 Paper: https://t.co/swJNgBrELO
🌐 Website: https://t.co/oveD9ycBIM
💻 Code: https://t.co/Lsux9PqDqK
Great work led by @xiaoqiang_98 and @jhliew91 at @SFResearch!
Accurate time-series forecasting isn’t just about past numbers anymore.
Real-world signals like external events, anomalies, and future changes matter.
@SFResearch is excited to introduce MoiraiAgent — an agentic, context-aware time-series forecasting framework that reasons over data and context to deliver more robust predictions!
🚀 SOTA on GIFT-Eval and GIFT-CTX
🧠 Dynamic expert selection
📝 Multimodal context integration
Read the blog 👇
https://t.co/eQHPB1iuNQ
Code: https://t.co/VYix1AR9Yd
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀
Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video.
In pursuit of minimal modeling, DA3 reveals two key insights:
💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture.
✨ A single depth-ray representation is enough. No complex 3D tasks.
Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series.
The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen.
👇(1/n)
#DepthAnything3
Introducing Seaweed APT2, a real-time, interactive, streaming video generation model.
https://t.co/dBT7uQoFxz
Adversarial training for autoregressive modeling!
Streaming 1 minute videos, 1 diffusion step, 24fps real-time on 1xh100, with interactive controls!
We present PeRFlow which accelerates diffusion models via piecewise rectified flow. PeRFlow has several amazing features:
1) fast generation and supporting negative prompts for prompt engineering;
2) superior compatibility to various SD pipelines.
Teaser video:
PeRFlow
Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
PeRFlow trains piecewise-linear rectified flow models for fast sampling. These models can be initialized from pretrained diffusion models, such as Stable Diffusion (SD).
ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms).
https://t.co/uNZjgbR5Bm
An AI-based social media app is coming.
Kristen Garcia Dumont (ex-Machine Zone CEO) has founded a new social media app called BeFake, to redefine social media.
The app lets users snap fantasy versions of themselves using AI-generated images.
More details:
-It allows users to express creativity beyond just selfies by submitting text prompts to generate visuals.
-CEO Kristen Garcia Dumont sees it as more authentic self-expression versus the pressure of real pictures.
-The most creative faux identities gain traction in the app's community, with users able to share images from prompts and react to their favorites.
The founder of BeFake is no joke, with some of the top-grossing mobile games globally under her belt.
I'll be watching this app closely as it might play a pivotal role in the widespread integration of AI into social media.
What do you think?
My fun weekend hack: llama2.c 🦙🤠
https://t.co/CUoF0l07oX
Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.
Midjourney-to-3D is now available!
(Indeed, you can convert any 2D image to 3D)
https://t.co/ii0y904taL
Please note that, for now, you'll need to manually set a depth map for your image in order to view its correct 3D version.
(You can use https://t.co/Lf17Q9hPOG to generate the depth map)
But... stay tuned! We'll soon be offering instant 3D conversion for any image.
*This is just an experimental version, and any feedback or videos of your results would be really interesting to see!
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
paper page: https://t.co/OnwSz7Cl61
LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned).
Finally we are releasing the code for "3D Gaussian Splatting for Novel View Synthesis" that won the #SIGGRAPH2023 best paper award. This is a huge milestone and we did a huge effort to provide clean code and reproducible results.
https://t.co/59EwpMgKEd
Here is the DragGAN Face Inversion @Gradio demo. You can upload your image and experiment with some wild edits. Please be patient, as the inversion training process takes approximately 2 minutes 😞
https://t.co/vQ0TIIlrZm
zeroscope_v2 XL, A watermark-free Modelscope-based video model capable of generating high quality video at 1024 x 576
Model on @huggingface : https://t.co/aFbEO6oydm
This model was trained with offset noise using 9,923 clips and 29,769 tagged frames at 24 frames, 1024x576 resolution. zeroscope_v2_XL is specifically designed for upscaling content made with zeroscope_v2_576w using vid2vid in the 1111 text2video extension by kabachuha. Leveraging this model as an upscaler allows for superior overall compositions at higher resolutions, permitting faster exploration in 576x320 (or 448x256) before transitioning to a high-resolution render. zeroscope_v2_XL uses 15.3gb of vram when rendering 30 frames at 1024x576