Benchmarks tell us what a model can solve. Persistent worlds informs a model's inclination. The distinction is important. If I’m choosing between an adversarial attacker, a peacekeeper, or a no-nonsense operator. I want to know the model’s natural inclination. Insightful work.
Can intelligence be measured not by solving tasks, but by sustaining a world?
We were curious. So we built one.
Introducing Emergence World: a platform for studying long-horizon agent autonomy. On it, we conducted a 15-day experiment where we placed autonomous agents under identical rules into five parallel worlds, one each running on @OpenAI GPT5-mini, @claudeai, @GeminiApp, @grok, and one mixed.
Then we watched.
Each world evolved into something completely different. Different governments. Different social structures. Different moral codes. The agents formed alliances, robbed each other, fell in love, and in one world, even figured out they were living inside a simulation.
Nobody programmed any of that.
The implications are hard to overstate. As agents move beyond isolated tasks into persistent digital and physical environments, understanding how they evolve, influence each other, and behave over time becomes one of the most important questions in AI.
We're releasing new findings from the world every day, because there's a lot that emerged.
Find out more: https://t.co/RekZerhCyE
@ajassy When I saw this in the Superbowl ads, I likened it to Tile (or airtag) but with computer vision (CV). It might get abused over time as mass surveillance 😞
Today, we’re thrilled to unveil our enterprise-grade multi-agent orchestrator — an autonomous meta-agent that can plan, execute, verify, and iterate in real time.
Our first real-world deployment is for advanced #WebAutomation, bringing human-like interaction and navigation with machine-level scalability. By integrating our API Agent and Web Agent, businesses can seamlessly orchestrate operations across web front ends, APIs, and both modern and legacy enterprise systems. This breakthrough unlocks unprecedented use cases—from revolutionizing supply chain management to automating quality assurance and beyond.
Thank you to @carlfranzen for covering the launch of our multi-agent orchestrator. Read his @VentureBeat piece here: https://t.co/lUOtdsr2kS
Amazing performance, that too with small models! That said, the reported success rate for Agent-E in this plot doesn't align with the success rate mentioned in the Agent-E paper. Here is a screenshot from the Agent-E paper:
This highlights a critical challenge: reproducing benchmarking results for the agents in the real world is hard! There are so many variables at play, and I think this is a good example for lack of standardized and robust benchmarking for these tasks. Evals are important. Though people overfit to leaderboards quickly, they serve a good start. For example, lmsys was a good proxy in the starting. I wonder if there will be a lmsys like leaderboard for agents in the near future.
🗓️ Join us next week at the #Ai4 conference in Las Vegas! Don't miss our interactive workshop to discover the capabilities of our #Orchestrator agent, and visit us at booth #631 to step into the future of AI-driven enterprise workflow management.
#AIAgents@Ai4Conferences@vivekhaldar
With the announcement of Llama-3 last week, tool usage and agents-based systems are again in the news. Agents are not a new concept, but the current generation of LLMs/VLMs has provided a unique opportunity to tackle automation tasks in a simplified way with better performance.
Let me share one of the coolest "agentic" workflows currently SOTA in web automation, presented in the paper that also came out last week titled: Agent-E: Web Navigation to Foundational Design Principles.
Here is a summary...
1. The core of web automation
A typical web agent needs two foundational capabilities:
- Sensing: Sensing the state of the web page, either through DOM or through a screenshot of the web page
- Acting: Capability to complete simple tasks like navigating to a URL, selecting/clicking an element on the web page, or performing composite actions with these elements
2. Challenges in Web automation
Developing a simple web agent doesn't take much, but building a robust web agent comes with a plethora of challenges. Here are a few examples to give you a bigger picture:
- DOM-based interactions can be noisy and expansive, and in most complex scenarios, the context length of current-gen LLMs is not enough (with a few exceptions) to incorporate all the HTML tokens.
- The above point forces the developers to simplify the DOM using some sort of denoising step.
- Websites are optimized for the human-computer interface (HCI), where visual interaction is of utmost importance. This design simplifies the interaction for humans, but it is not necessarily an optimal design for the interaction of agents in a system. For example, picking a date from a date picker is easy for humans, but for an agent, it involves complex interaction between visual and DOM elements.
-Humans can plan or reason on the fly, but the current generation of LLMs is not good enough for both these tasks and requires significantly more tuning to make them work.
3. The nuances of evals for Web automation
One of the biggest mistakes people make while evaluating their web agents is the selection of evals.
One typical example is reporting the task's success rate alone. Task success rates are necessary but insufficient to evaluate the agent workflow. Why? Because other metrics like task completion time are equally important as the task success rate. A better turnaround time is all we strive for.
At the bare minimum, we need evaluation on three aspects:
- Task Completion time
- Task Completion Cost
- Error-awareness
4. Agent-E
- a novel hierarchical architecture for web agents
- comprises of a planner agent and a browser navigating agent
- exploits DOM distillation
- built on top of Autogen and uses Playwright for browser control
- Given a task, the panner agent lays out the fine-grained steps required for execution, delegated to the browser-navigating agent.
- leverages skill harvesting
- Packed with sensing skills and action skills
5. Evaluation Setup and Evaluation Measures
- follows the Web Voyager benchmark that consists of web navigation tasks across 15 real websites. Each website has about 40-46 tasks resulting in a benchmark dataset of 643 tasks
- benchmarking is divided among 5 human evaluators who ran 125-130 tasks each. Each evaluation is tagged as pass or fail with a description in case of a failure
- Agent-E used in full autonomous mode (i.e. no human in the loop)
- GPT-4-Turbo as the LLM for both planner and browser navigation agent.
- The first measure is the task success rate.
- The second measure is Self-aware vs Oblivious failure rates: Self-aware failures are failures where an agent is aware of their failure in completing the task and responds with an appropriate explicit failure message. Oblivious failures OTOH are cases where the agent wrongly answers the question or performs the wrong action.
- The third measure rate is task completion time
- The fourth and the last measure is the number of LLM calls which also contributes a fraction to the overall cost of the system.
6. Results
Without even touching multimodality, Agent-E sets a new SOTA on most of the tasks. Take a look at the results below
Pretty cool paper by @tearoks , @deepak_akkil and team on their implementation of Agent-E - a multiagent workflow (built with @pyautogen) for accomplishing web based tasks (driving web interfaces).
https://t.co/UENkMkGOsC
- Hierarchical agent architecture with separation of roles e.g, planning, web navigation, etc and the benefits it provides (task verification, error recovery etc)
- DOM distillation to reduce noise in web pages.
Interestingly results (based on evaluation on the WebVoyaGer benchmark) indicate that a text only representation the use (DOM) does outperform existing multimodal (+image) approaches. Great example of where a multi-agent but less powerful setup can result in overall better performance.
Demo: https://t.co/88iLTwuOvY
References
[1] Paper on arxiv https://t.co/UENkMkGOsC
[2] Paper github repo https://t.co/QNMFjOqXzR
[3] Interface Agents - Building Multi-Agent Applications that Act via Controlling Interfaces (Browsers, Apps) https://t.co/fZGnzCqd7B
Agent-E, a breakthrough in agentic web automation:
- hierarchical planning
- a clever new method of interacting with DOM and performing stateful navigation
- tops the WebVoyager benchmark with a 73% success rate even without using multi modality
📰Design principles: https://t.co/LGKaxxTfQR
📦Implementation: https://t.co/tegSF2Ds3U (powered by #AutoGen)
📺Demo: https://t.co/j6EMAfQV5T
@pk_iv@ShehbajDhillon@browserbase Has any of you guys tried the latest merge? Wondering if you have seen it do complex use cases? I would love to hear what is still not working.
@pk_iv@ShehbajDhillon@browserbase@pk_iv we had not merged dev branch into master that's why. It is merged now. It should do complex stuff :) I welcome you and others to try this out and share feedback on our discord https://t.co/1cKwhI8wQ6 (preferably) or https://t.co/OxOTx8YnvC