Adam Lu @Adamlu28 - Twitter Profile

Adamlu28 retweeted

14 days ago

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.” To be fair, I think their criticism is correct. I personally wished that AI was better integrated on my phone. LLMs can solve IMO problems, so shouldn’t it be a cakewalk for it to remind me of the text I forgot to respond to last week? Obviously not, since it doesn’t exist in my pocket yet. Or maybe Apple’s new update yesterday fixed this and my research project is obsolete. We are releasing iOSWorld (https://t.co/oo4AXDdUId), a dynamic iPhone benchmark with 26 newly created apps grounded in personal context. Each of the 26 apps is centrally seeded around one persona, Jordan Avery, and the apps interact together in a realistic ecosystem that reflect real app interactions. We create 133 personalized mobile agent tasks to test in this environment, and the best model, even with privileged information, only scores 51%.

JangLawrenceK's tweet photo. An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

To be fair, I think their criticism is correct. I personally wished that AI was better integrated on my phone. LLMs can solve IMO problems, so shouldn’t it be a cakewalk for it to remind me of the text I forgot to respond to last week? Obviously not, since it doesn’t exist in my pocket yet. Or maybe Apple’s new update yesterday fixed this and my research project is obsolete.

We are releasing iOSWorld (https://t.co/oo4AXDdUId), a dynamic iPhone benchmark with 26 newly created apps grounded in personal context. Each of the 26 apps is centrally seeded around one persona, Jordan Avery, and the apps interact together in a realistic ecosystem that reflect real app interactions. We create 133 personalized mobile agent tasks to test in this environment, and the best model, even with privileged information, only scores 51%.

4

35

17

10

16K

Adamlu28 retweeted

Russ Salakhutdinov

@rsalakhu

about 1 month ago

Congrats to the Webwright team https://t.co/mmpl4tO0p4 at @MSFTResearch for taking the #1 spot on Odysseys, a highly challenging benchmark for long-horizon web agents: https://t.co/rj5BHK5g6C Odysseys evaluates realistic, multi-hour web workflows that require sustained planning, memory, reasoning, and verification across many websites and tools. These are far beyond short single-step browser tasks. For example, if you are searching for CS faculty positions, a single task could involve building a comprehensive Excel tracker of openings across the top CS schools using CSRankings as the master checklist; verifying every school directly through department, engineering, and university careers pages for CS/AI/ML/data science/robotics/vision faculty roles; opening and validating each posting; maintaining structured evidence and verification tabs; and finishing with a completeness audit and summary of hiring trends. Exciting progress toward truly capable long-horizon web agents.

1

34

8

28

5K

Adamlu28 retweeted

Dimitris Papailiopoulos

@DimitrisPapail

28 days ago

https://t.co/JX0uyS9na9

0

23

2

15

2K

Adamlu28 retweeted

Hussein Mozannar @HsseinMzannar

about 1 month ago

We're releasing a very capable browser use model Fara1.5-9B that feels like a step-change in terms of small CUA models capability achieving 63% on OnlineM2W auto-eval. We've put in a lot of work to make it useful for all types of web tasks. https://t.co/PCPyF5jmWh

HsseinMzannar's tweet photo. We're releasing a very capable browser use model Fara1.5-9B that feels like a step-change in terms of small CUA models capability achieving 63% on OnlineM2W auto-eval. We've put in a lot of work to make it useful for all types of web tasks.

https://t.co/PCPyF5jmWh https://t.co/mII9bTQGDr

4

38

10

18

4K

Adamlu28 retweeted

Kilian Lieret @KLieret

about 1 month ago

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

KLieret's tweet photo. The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵 https://t.co/ukMHUXYs1O

35

1K

120

299

245K

Adam Lu @Adamlu28

about 1 month ago

@JosephD @ms_aifrontiers Many web tasks are behind a website’s JS interaction layer, so search APIs alone aren’t enough. Good thing is that we find agents don’t need to mimic human browsing click by click in a single browser session, we think a terminal + code workspace is a better abstraction :)

0

2

0

16

Adam Lu @Adamlu28

about 1 month ago

@stevienipz @ms_aifrontiers Good question! Similar spirit, but different focus. Claude Code is a general coding agent; Webwright is a minimal harness for long-horizon and JS heavy web tasks that web fetch apis fail.

0

1

0

61

Adam Lu @Adamlu28

about 2 months ago

Great collaboration with @Xu_Lingrui_ @huang_chao4969 @AhmedHAwadallah!

0

34

Adam Lu @Adamlu28

about 2 months ago

🚀 Ready to turn your favorite coding models into state-of-the-art browser agents especially for long-running tasks and writing RPA scripts? 🔥Meet Webwright: our first SWE-style browser agent framework for web tasks, we show terminal is all you need to deliver SOTA performance

4

0

2

79

Adam Lu @Adamlu28

about 2 months ago

🤖Easy integration: integrate Webwright skills easily with Claude code, Codex, and OpenClaw, so it can be your most reliable and robust personal web agent. 📝 Blog: https://t.co/UXo5YxJzh0… 💻 Code: https://t.co/feztu1dLsY 🌐 Project Page: https://t.co/Cx9Ilcuijj

0

47

Adam Lu @Adamlu28

about 2 months ago

🏄Web agent browsing history becomes code: Webwright enforces each task to be completed end-to-end with code files-the way human engineers write Robotic Process Automation code. Instead of fragile clicks traces, the written RPA script makes future similar tasks way more efficient

0

1

0

26

Adam Lu @Adamlu28

about 2 months ago

🎯Strong performance: 60.8% on long horizon web benchmark Odysseys with GPT-5.4 (significant improvement over previous vision based SOTA 44.5%!), 26.6 points improvement over vision based GPT-5.4 (33.5%). 86.7% in Online-Mind2Web , the highest autoeval scores at 100 steps.

0

19

Adam Lu @Adamlu28

about 2 months ago

@JangLawrenceK @kohjingyu Thanks @JangLawrenceK, fortunate to have great benchmark like Odysseys! Seems like, in the short term, a hybrid approach makes the most sense—where a pure vision-based CUA can step in whenever a coding-focused CUA struggles to complete a task.

0

1

0

43

Adam Lu @Adamlu28

over 1 year ago

Browsers use🌐, code executors💻, and various apps📊—all supercharge LLMs. But what if one tool could do it all? Introducing OmniTool: a full Windows VM 🖥️ unlocking the true power of LLM agents. No extra infra each tool, just limitless possibilities. 🚀 #DeepResearch #Microsoft

Thomas Dhome-Casanova @swayingoak

over 1 year ago

🤖 Your LLM agent only needs 1 tool – an operating system. Introducing OmniTool from Microsoft Research. Use any app in Windows by pairing OmniParser V2 with your favourite LLM (GPT4o, O1, DeepSeek R1 or Qwen 2.5VL).

1

13

7

3K

0

1

0

121

Adamlu28 retweeted

Thomas Dhome-Casanova @swayingoak

over 1 year ago

🤖 Your LLM agent only needs 1 tool – an operating system. Introducing OmniTool from Microsoft Research. Use any app in Windows by pairing OmniParser V2 with your favourite LLM (GPT4o, O1, DeepSeek R1 or Qwen 2.5VL).

1

13

7

3K

Adam Lu @Adamlu28

over 1 year ago

🚀Huggingface Model checkpoint: https://t.co/JeBBDqNk1U… 📷MSR Blog Post: https://t.co/Dl9BMZNGqz… 📷https://t.co/uXvcKc1Ams… 📷Video demo for OmniTool:

0

2

0

108

Adam Lu @Adamlu28

over 1 year ago

Ready to transform your favorite reasoning large language models into AI agents that can seamlessly operate across PC, mobile, and web platforms?

1

14

5

3

1K

Adam Lu @Adamlu28

over 1 year ago

It supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Claude, Qwen (2.5VL) Discover more here: 📝 Blog: OmniParser V2: Turning Any LLM into a Computer Use Agent - Microsoft Research 💻 Code: microsoft/OmniParser

2

0

155

Adam Lu @Adamlu28

over 1 year ago

Model: microsoft/OmniParser-v2.0 · Hugging Face Great collaboration with: @swayingoak @jw2yang4ai Ahmed Awadallah

0

2

0

110

Adam Lu

@Adamlu28

Last Seen Users on Sotwe

Trends for you

Most Popular Users