Thibault LSDC @tlsdc_ - Twitter Profile

tlsdc_ retweeted

about 1 year ago

1/ How do we evaluate agent vulnerabilities in situ, in dynamic environments, under realistic threat models? We present 🔥 DoomArena 🔥 — a plug-in framework for grounded security testing of AI agents. ✨Project : https://t.co/yOsZize8V1 📝Paper: https://t.co/jjEnJu9Vf6

8

38

16

9

7K

tlsdc_ retweeted

Thibault LSDC @tlsdc_

over 1 year ago

@Swarooprm7 AgentLab/BrowserGym brings together MiniWoB, WorkArena, WebArena, VisualWebArena, WebLINX, and AssistantBench in a single codebase—making real-world, agentic evaluations seamless and efficient! 🚀 https://t.co/7O1wzGzmTZ

0

4

1

0

54

tlsdc_ retweeted

Massimo Caccia

@MassCaccia

over 1 year ago

@Swarooprm7 Can't beat WorkArena :) https://t.co/LSiKVOinQO

1

5

1

0

172

Thibault LSDC @tlsdc_

over 1 year ago

@Swarooprm7 AgentLab/BrowserGym brings together MiniWoB, WorkArena, WebArena, VisualWebArena, WebLINX, and AssistantBench in a single codebase—making real-world, agentic evaluations seamless and efficient! 🚀 https://t.co/7O1wzGzmTZ

Alexandre Lacoste @alex_lacoste_

over 1 year ago

🧵-1 We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

alex_lacoste_'s tweet photo. 🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena. https://t.co/byutypu9Le

4

154

59

84

28K

0

4

1

0

54

Thibault LSDC @tlsdc_

over 1 year ago

🔍 Try it out here: https://t.co/ds49SsJhlH

0

2

0

71

Thibault LSDC @tlsdc_

over 1 year ago

🚀 We recently released on HuggingFace a demo of AgentXray, our tool for analyzing web agent traces! Built on AgentLab & BrowserGym, it provides in-depth insights into web agent behaviors for research & benchmarking. Link below! #WebAgents #AgentLab #BrowserGym #HuggingFace

tlsdc_'s tweet photo. 🚀 We recently released on HuggingFace a demo of AgentXray, our tool for analyzing web agent traces! Built on AgentLab & BrowserGym, it provides in-depth insights into web agent behaviors for research & benchmarking. Link below!

#WebAgents #AgentLab #BrowserGym #HuggingFace https://t.co/7HBXnoRaRe

1

10

5

1

1K

tlsdc_ retweeted

Léo Boisvert @LeoBoisvert

over 1 year ago

📊 Fresh WorkArena benchmark results just dropped! Plot twist: o1-mini (51.8%) > o3-mini (48.2%) Either o1-mini had its coffee this morning ☕️ or we've stumbled upon something interesting 🧐 Replication studies welcome!

LeoBoisvert's tweet photo. 📊 Fresh WorkArena benchmark results just dropped!
Plot twist: o1-mini (51.8%) > o3-mini (48.2%)
Either o1-mini had its coffee this morning ☕️ or we've stumbled upon something interesting 🧐
Replication studies welcome! https://t.co/JWZT83i07N

1

12

6

0

3K

tlsdc_ retweeted

Léo Boisvert @LeoBoisvert

over 1 year ago

🔥 Fresh off the GPU, new WorkArena-L1 results are in! 🔥 Llama 3.3 70B: 34.5% (↑6.6% from 3.1) Qwen 2.5 32B: 27.9% Even the small models shine: Qwen 2.5 7B (8.2%) doubles Llama 3.1 8B (4%)! ☕️ These models are working harder than me on a Monday morning ☕️

LeoBoisvert's tweet photo. 🔥 Fresh off the GPU, new WorkArena-L1 results are in! 🔥
Llama 3.3 70B: 34.5% (↑6.6% from 3.1) Qwen 2.5 32B: 27.9%
Even the small models shine: Qwen 2.5 7B (8.2%) doubles Llama 3.1 8B (4%)!
☕️ These models are working harder than me on a Monday morning ☕️ https://t.co/d2Jx8soY80

1

22

10

2

2K

tlsdc_ retweeted

Alexandre Lacoste @alex_lacoste_

over 1 year ago

If you're in #NeurIPS2024 or Vancouver, please join us for this happy hour event about web agents.

0

15

3

1

1K

tlsdc_ retweeted

Massimo Caccia

@MassCaccia

over 1 year ago

Following last week release of AgentLab, here's our thorough analysis of your most popular LLM web agents on your favorite web agent benchmarks! Hope you enjoy :) If you are at @NeurIPSConf, come chat with us tmr at our co-hosted Happy Hour on WebAgent development! 📅 Date: Dec 13th 6:00pm 📍 Location: 15min walk from Neurips see details after RSVP 🎉 RSVP Here: https://t.co/6AbSJgtD76

0

13

4

1

1K

tlsdc_ retweeted

Alexandre Lacoste @alex_lacoste_

over 1 year ago

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof. In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

alex_lacoste_'s tweet photo. We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet https://t.co/nLXTFiA2Ch

3

105

31

29

14K

tlsdc_ retweeted

Alexandre Lacoste @alex_lacoste_

over 1 year ago

🧵-1 We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

4

154

59

84

28K

Thibault LSDC

@tlsdc_

Last Seen Users on Sotwe

Trends for you

Most Popular Users