1/ How do we evaluate agent vulnerabilities in situ, in dynamic environments, under realistic threat models?
We present 🔥 DoomArena 🔥 — a plug-in framework for grounded security testing of AI agents.
✨Project : https://t.co/yOsZize8V1
📝Paper: https://t.co/jjEnJu9Vf6
@Swarooprm7 AgentLab/BrowserGym brings together MiniWoB, WorkArena, WebArena, VisualWebArena, WebLINX, and AssistantBench in a single codebase—making real-world, agentic evaluations seamless and efficient! 🚀
https://t.co/7O1wzGzmTZ
@Swarooprm7 AgentLab/BrowserGym brings together MiniWoB, WorkArena, WebArena, VisualWebArena, WebLINX, and AssistantBench in a single codebase—making real-world, agentic evaluations seamless and efficient! 🚀
https://t.co/7O1wzGzmTZ
🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
🚀 We recently released on HuggingFace a demo of AgentXray, our tool for analyzing web agent traces! Built on AgentLab & BrowserGym, it provides in-depth insights into web agent behaviors for research & benchmarking. Link below!
#WebAgents#AgentLab#BrowserGym#HuggingFace
📊 Fresh WorkArena benchmark results just dropped!
Plot twist: o1-mini (51.8%) > o3-mini (48.2%)
Either o1-mini had its coffee this morning ☕️ or we've stumbled upon something interesting 🧐
Replication studies welcome!
🔥 Fresh off the GPU, new WorkArena-L1 results are in! 🔥
Llama 3.3 70B: 34.5% (↑6.6% from 3.1) Qwen 2.5 32B: 27.9%
Even the small models shine: Qwen 2.5 7B (8.2%) doubles Llama 3.1 8B (4%)!
☕️ These models are working harder than me on a Monday morning ☕️
Following last week release of AgentLab, here's our thorough analysis of your most popular LLM web agents on your favorite web agent benchmarks! Hope you enjoy :)
If you are at @NeurIPSConf, come chat with us tmr at our co-hosted Happy Hour on WebAgent development!
📅 Date: Dec 13th 6:00pm
📍 Location: 15min walk from Neurips see details after RSVP
🎉 RSVP Here: https://t.co/6AbSJgtD76
We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.