An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”
To be fair, I think their criticism is correct. I personally wished that AI was better integrated on my phone. LLMs can solve IMO problems, so shouldn’t it be a cakewalk for it to remind me of the text I forgot to respond to last week? Obviously not, since it doesn’t exist in my pocket yet. Or maybe Apple’s new update yesterday fixed this and my research project is obsolete.
We are releasing iOSWorld (https://t.co/oo4AXDdUId), a dynamic iPhone benchmark with 26 newly created apps grounded in personal context. Each of the 26 apps is centrally seeded around one persona, Jordan Avery, and the apps interact together in a realistic ecosystem that reflect real app interactions. We create 133 personalized mobile agent tasks to test in this environment, and the best model, even with privileged information, only scores 51%.
Congrats to the Webwright team https://t.co/mmpl4tO0p4 at @MSFTResearch for taking the #1 spot on Odysseys, a highly challenging benchmark for long-horizon web agents:
https://t.co/rj5BHK5g6C
Odysseys evaluates realistic, multi-hour web workflows that require sustained planning, memory, reasoning, and verification across many websites and tools. These are far beyond short single-step browser tasks.
For example, if you are searching for CS faculty positions, a single task could involve building a comprehensive Excel tracker of openings across the top CS schools using CSRankings as the master checklist; verifying every school directly through department, engineering, and university careers pages for CS/AI/ML/data science/robotics/vision faculty roles; opening and validating each posting; maintaining structured evidence and verification tabs; and finishing with a completeness audit and summary of hiring trends.
Exciting progress toward truly capable long-horizon web agents.
We're releasing a very capable browser use model Fara1.5-9B that feels like a step-change in terms of small CUA models capability achieving 63% on OnlineM2W auto-eval. We've put in a lot of work to make it useful for all types of web tasks.
https://t.co/PCPyF5jmWh
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵
@JosephD@ms_aifrontiers Many web tasks are behind a website’s JS interaction layer, so search APIs alone aren’t enough. Good thing is that we find agents don’t need to mimic human browsing click by click in a single browser session, we think a terminal + code workspace is a better abstraction :)
@stevienipz@ms_aifrontiers Good question! Similar spirit, but different focus. Claude Code is a general coding agent; Webwright is a minimal harness for long-horizon and JS heavy web tasks that web fetch apis fail.
🚀 Ready to turn your favorite coding models into state-of-the-art browser agents especially for long-running tasks and writing RPA scripts?
🔥Meet Webwright: our first SWE-style browser agent framework for web tasks, we show terminal is all you need to deliver SOTA performance
🤖Easy integration: integrate Webwright skills easily with Claude code, Codex, and OpenClaw, so it can be your most reliable and robust personal web agent.
📝 Blog: https://t.co/UXo5YxJzh0…
💻 Code: https://t.co/feztu1dLsY
🌐 Project Page: https://t.co/Cx9Ilcuijj
🏄Web agent browsing history becomes code: Webwright enforces each task to be completed end-to-end with code files-the way human engineers write Robotic Process Automation code. Instead of fragile clicks traces, the written RPA script makes future similar tasks way more efficient
🎯Strong performance: 60.8% on long horizon web benchmark Odysseys with GPT-5.4 (significant improvement over previous vision based SOTA 44.5%!), 26.6 points improvement over vision based GPT-5.4 (33.5%). 86.7% in Online-Mind2Web , the highest autoeval scores at 100 steps.
@JangLawrenceK@kohjingyu Thanks @JangLawrenceK, fortunate to have great benchmark like Odysseys!
Seems like, in the short term, a hybrid approach makes the most sense—where a pure vision-based CUA can step in whenever a coding-focused CUA struggles to complete a task.
Browsers use🌐, code executors💻, and various apps📊—all supercharge LLMs. But what if one tool could do it all? Introducing OmniTool: a full Windows VM 🖥️ unlocking the true power of LLM agents. No extra infra each tool, just limitless possibilities. 🚀 #DeepResearch#Microsoft
🤖 Your LLM agent only needs 1 tool – an operating system. Introducing OmniTool from Microsoft Research. Use any app in Windows by pairing OmniParser V2 with your favourite LLM (GPT4o, O1, DeepSeek R1 or Qwen 2.5VL).
🤖 Your LLM agent only needs 1 tool – an operating system. Introducing OmniTool from Microsoft Research. Use any app in Windows by pairing OmniParser V2 with your favourite LLM (GPT4o, O1, DeepSeek R1 or Qwen 2.5VL).
It supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Claude, Qwen (2.5VL)
Discover more here: 📝 Blog: OmniParser V2: Turning Any LLM into a Computer Use Agent - Microsoft Research 💻 Code: microsoft/OmniParser