Jun Hao Liew @jhliew91 - Twitter Profile

jhliew91 retweeted

3 months ago

Introducing 🔥GPA🔥: GUI Process Automation from Salesforce AI Research Project page: https://t.co/b3LGmmmaRp Demo: https://t.co/fUHrqkf2w2 Submitting receipts, logging customer meeting notes, navigating clunky enterprise UIs… all on autopilot. GPA learns directly from you by watching how you do a task once — mouse clicks, keyboard inputs, everything. No brittle scripts. No manual setup. GPA is the next-gen RPA.

2

8

3

1

1K

jhliew91 retweeted

Salesforce AI Research

@SFResearch

3 months ago

▶️ Introducing GPA: GUI Process Automation from Salesforce AI Research 🧑‍💻 Technical Blog and Demo: https://t.co/oO61VnHoFc 📎 Paper: https://t.co/Qmg6v3FnRq Record one workflow demo. Replay it automatically — deterministic, fully local, and free. Why does this matter? Most GUI agents send your screenshots to the cloud, burn tokens on every click, and still guess wrong 10% of the time. A new framework for RPA, GPA takes a different approach. In pilot testing against Gemini 3 Pro's computer-use agent, GPA achieved 100% success rate at ~10× faster execution across 16 desktop GUI tasks. No prompt engineering. No cloud calls. No randomness. #EfficientAI #EnterpriseAI #FutureofAI

SFResearch's tweet photo. ▶️ Introducing GPA: GUI Process Automation from Salesforce AI Research

🧑‍💻 Technical Blog and Demo: https://t.co/oO61VnHoFc
📎 Paper: https://t.co/Qmg6v3FnRq

Record one workflow demo. Replay it automatically — deterministic, fully local, and free.

Why does this matter? Most GUI agents send your screenshots to the cloud, burn tokens on every click, and still guess wrong 10% of the time. A new framework for RPA, GPA takes a different approach.

In pilot testing against Gemini 3 Pro's computer-use agent, GPA achieved 100% success rate at ~10× faster execution across 16 desktop GUI tasks.

No prompt engineering. No cloud calls. No randomness.

#EfficientAI #EnterpriseAI #FutureofAI

1

18

5

6

2K

jhliew91 retweeted

Salesforce AI Research

@SFResearch

4 months ago

Deep research agents typically scale depth—more sequential steps. But what about scaling width? 🤔 📄 Paper: https://t.co/TLP3YBEHUZ We introduce Wide & Deep (W&D) research agents: a framework exploring parallel tool calling to boost performance while reducing costs and latency. Key results on BrowseComp, HLE, and GAIA: 📊 Parallel tool calling improves accuracy across GPT-5, Gemini, and Claude 💰 36% reduction in API costs, 41% reduction in wall-clock time 🎯 W&D with GPT-5-Medium achieves 62.2% on BrowseComp—beating GPT-5-High's 54.9% Why it works: 🔍 Enhanced source credibility through diverse information gathering ✅ Tool result verification catches unreliable outputs 🧩 Query decomposition improves retrieval effectiveness We also tested tool call schedulers. A "descending" strategy—explore early, exploit later—added another ~6% gain. 📈 Unlike complex multi-agent orchestration, W&D uses intrinsic parallel tool calling within a single reasoning step, making it easy to integrate into existing agent frameworks. 🌐 Website / Code: https://t.co/FajxSDV5XY Authors: Xiaoqiang Lin @xiaoqiang_98, Jun Hao Liew @jhliew91, Silvio Savarese @silviocinguetta, and Junnan Li @LiJunnan0409 at @Salesforce AI Research. #FutureOfAI #EnterpriseAI #AIAgents

SFResearch's tweet photo. Deep research agents typically scale depth—more sequential steps. But what about scaling width? 🤔

📄 Paper: https://t.co/TLP3YBEHUZ

We introduce Wide & Deep (W&D) research agents: a framework exploring parallel tool calling to boost performance while reducing costs and latency.

Key results on BrowseComp, HLE, and GAIA:
📊 Parallel tool calling improves accuracy across GPT-5, Gemini, and Claude 💰 36% reduction in API costs, 41% reduction in wall-clock time 🎯 W&D with GPT-5-Medium achieves 62.2% on BrowseComp—beating GPT-5-High's 54.9%

Why it works:
🔍 Enhanced source credibility through diverse information gathering
✅ Tool result verification catches unreliable outputs
🧩 Query decomposition improves retrieval effectiveness

We also tested tool call schedulers. A "descending" strategy—explore early, exploit later—added another ~6% gain. 📈

Unlike complex multi-agent orchestration, W&D uses intrinsic parallel tool calling within a single reasoning step, making it easy to integrate into existing agent frameworks.

🌐 Website / Code: https://t.co/FajxSDV5XY

Authors: Xiaoqiang Lin @xiaoqiang_98, Jun Hao Liew @jhliew91, Silvio Savarese @silviocinguetta, and Junnan Li @LiJunnan0409 at @Salesforce AI Research.

#FutureOfAI #EnterpriseAI #AIAgents

5

22

7

13

2K

jhliew91 retweeted

Li Junnan

@LiJunnan0409

4 months ago

We introduce 🔍Wide & Deep (W&D) research agents: scale width by making more parallel tool calls per turn. Scaling width boosts accuracy on BrowseComp, HLE, and GAIA — while cutting turns, API cost, and wall-clock time. A simple descending scheduler (explore early → exploit later) adds another ~6% gain. Our W&D agent with GPT-5-medium hits 62.2% on BrowseComp, beating GPT-5-high deep research (54.9%). 📄 Paper: https://t.co/swJNgBrELO 🌐 Website: https://t.co/oveD9ycBIM 💻 Code: https://t.co/Lsux9PqDqK Great work led by @xiaoqiang_98 and @jhliew91 at @SFResearch!

LiJunnan0409's tweet photo. We introduce 🔍Wide & Deep (W&D) research agents: scale width by making more parallel tool calls per turn.

Scaling width boosts accuracy on BrowseComp, HLE, and GAIA — while cutting turns, API cost, and wall-clock time.

A simple descending scheduler (explore early → exploit later) adds another ~6% gain.

Our W&D agent with GPT-5-medium hits 62.2% on BrowseComp, beating GPT-5-high deep research (54.9%).

📄 Paper: https://t.co/swJNgBrELO
🌐 Website: https://t.co/oveD9ycBIM
💻 Code: https://t.co/Lsux9PqDqK

Great work led by @xiaoqiang_98 and @jhliew91 at @SFResearch!

1

49

8

30

4K

Who to follow

Linjie (Lindsey) Li

@LINJIEFUN

researching @Microsoft, @UW, contributed to https://t.co/VzcJa9Skx3

M. Akhtar Munir

@akhtarTalks

Postdoc Researcher (CV/ML) @mbzuai. Area/Interest: Computer Vision & Deep Learning. Visited: 🇨🇦🇸🇦🇦🇪🇬🇧🇴🇲🇺🇲

Saksham Suri ✈️ CVPR

@_sakshams_

Research Scientist @AiatMeta. Previously PhD @UMDCS, @MetaAI, @AmazonScience, @USCViterbi, @IIITDelhi, @IBMResearch. #computervision #deeplearning

jhliew91 retweeted

Li Junnan

@LiJunnan0409

5 months ago

Accurate time-series forecasting isn’t just about past numbers anymore. Real-world signals like external events, anomalies, and future changes matter. @SFResearch is excited to introduce MoiraiAgent — an agentic, context-aware time-series forecasting framework that reasons over data and context to deliver more robust predictions! 🚀 SOTA on GIFT-Eval and GIFT-CTX 🧠 Dynamic expert selection 📝 Multimodal context integration Read the blog 👇 https://t.co/eQHPB1iuNQ Code: https://t.co/VYix1AR9Yd

3

10

2

1

1K

jhliew91 retweeted

Bingyi Kang

@bingyikang

7 months ago

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen. 👇(1/n) #DepthAnything3

79

4K

492

2K

514K

jhliew91 retweeted

Peter Lin @peter9863

about 1 year ago

Introducing Seaweed APT2, a real-time, interactive, streaming video generation model. https://t.co/dBT7uQoFxz Adversarial training for autoregressive modeling! Streaming 1 minute videos, 1 diffusion step, 24fps real-time on 1xh100, with interactive controls!

14

194

36

129

20K

jhliew91 retweeted

Hanshu Yan

@hanshu_yan

over 2 years ago

We present PeRFlow which accelerates diffusion models via piecewise rectified flow. PeRFlow has several amazing features: 1) fast generation and supporting negative prompts for prompt engineering; 2) superior compatibility to various SD pipelines. Teaser video:

2

52

11

21

27K

jhliew91 retweeted

AK

@_akhaliq

over 2 years ago

PeRFlow Piecewise Rectified Flow as Universal Plug-and-Play Accelerator PeRFlow trains piecewise-linear rectified flow models for fast sampling. These models can be initialized from pretrained diffusion models, such as Stable Diffusion (SD).

1

53

11

27

11K

jhliew91 retweeted

OpenAI

@OpenAI

over 2 years ago

ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm

2K

38K

9K

4K

11M

jhliew91 retweeted

Rowan Cheung

@rowancheung

almost 3 years ago

An AI-based social media app is coming. Kristen Garcia Dumont (ex-Machine Zone CEO) has founded a new social media app called BeFake, to redefine social media. The app lets users snap fantasy versions of themselves using AI-generated images. More details: -It allows users to express creativity beyond just selfies by submitting text prompts to generate visuals. -CEO Kristen Garcia Dumont sees it as more authentic self-expression versus the pressure of real pictures. -The most creative faux identities gain traction in the app's community, with users able to share images from prompts and react to their favorites. The founder of BeFake is no joke, with some of the top-grossing mobile games globally under her belt. I'll be watching this app closely as it might play a pivotal role in the widespread integration of AI into social media. What do you think?

107

826

149

486

350K

jhliew91 retweeted

Andrej Karpathy

@karpathy

almost 3 years ago

My fun weekend hack: llama2.c 🦙🤠 https://t.co/CUoF0l07oX Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.

karpathy's tweet photo. My fun weekend hack: llama2.c 🦙🤠
https://t.co/CUoF0l07oX
Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU. https://t.co/aBvKCf1t2u

89

5K

690

2K

1M

jhliew91 retweeted

Cristian Peñas ░░░░░░░░

@ilumine_ai

almost 3 years ago

Midjourney-to-3D is now available! (Indeed, you can convert any 2D image to 3D) https://t.co/ii0y904taL Please note that, for now, you'll need to manually set a depth map for your image in order to view its correct 3D version. (You can use https://t.co/Lf17Q9hPOG to generate the depth map) But... stay tuned! We'll soon be offering instant 3D conversion for any image. *This is just an experimental version, and any feedback or videos of your results would be really interesting to see!

67

3K

531

2K

445K

jhliew91 retweeted

AK

@_akhaliq

almost 3 years ago

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs paper page: https://t.co/OnwSz7Cl61 LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned).

2

141

46

67

23K

jhliew91 retweeted

George Kopanas @gkopanas

almost 3 years ago

Finally we are releasing the code for "3D Gaussian Splatting for Novel View Synthesis" that won the #SIGGRAPH2023 best paper award. This is a huge milestone and we did a huge effort to provide clean code and reproducible results. https://t.co/59EwpMgKEd

3

471

78

143

49K

jhliew91 retweeted

Radamés Ajna

@radamar

almost 3 years ago

Here is the DragGAN Face Inversion @Gradio demo. You can upload your image and experiment with some wild edits. Please be patient, as the inversion training process takes approximately 2 minutes 😞 https://t.co/vQ0TIIlrZm

6

706

154

332

114K

jhliew91 retweeted

AK

@_akhaliq

almost 3 years ago

Unity Introduces Unity Muse and Unity Sentis, AI-powered creativity blog: https://t.co/6hMK21ryb1

4

354

91

96

65K

jhliew91 retweeted

AK

@_akhaliq

almost 3 years ago

midjourney version 5.2 zoom out feature: Unleashing the Potential of A Broader View

49

3K

443

396

502K

jhliew91 retweeted

AK

@_akhaliq

almost 3 years ago

zeroscope_v2 XL, A watermark-free Modelscope-based video model capable of generating high quality video at 1024 x 576 Model on @huggingface : https://t.co/aFbEO6oydm This model was trained with offset noise using 9,923 clips and 29,769 tagged frames at 24 frames, 1024x576 resolution. zeroscope_v2_XL is specifically designed for upscaling content made with zeroscope_v2_576w using vid2vid in the 1111 text2video extension by kabachuha. Leveraging this model as an upscaler allows for superior overall compositions at higher resolutions, permitting faster exploration in 576x320 (or 448x256) before transitioning to a high-resolution render. zeroscope_v2_XL uses 15.3gb of vram when rendering 30 frames at 1024x576