Adam Sadovsky @asadovsky - Twitter Profile

Adam Sadovsky @asadovsky

about 3 hours ago

New evals are nice because they show who’s really ahead & pushing the frontier versus who’s chasing.

Arena.ai

@arena

about 18 hours ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

46

874

108

240

253K

3

0

89

Adam Sadovsky @asadovsky

2 days ago

Today we announced MAI-Thinking-1, a strong generalist and reasoning LLM built from the ground up without distilling third-party models. 97% on AIME 2025; 53% on SWE-Bench Pro; preferred by human raters over Sonnet 4.6 (blind side-by-side). Tech report: https://t.co/qxGQWX5cOt

13

261

18

71

20K

asadovsky retweeted

Microsoft AI

@MicrosoftAI

3 days ago

Seven new models launching at Build: let’s go! Reasoning. Code. Image. Transcribe. Voice. Built from scratch on a clean data lineage, designed for efficiency, working seamlessly as a family of models Thread 🧵 #MSBuild

MicrosoftAI's tweet photo. Seven new models launching at Build: let’s go!
Reasoning. Code. Image. Transcribe. Voice.

Built from scratch on a clean data lineage, designed for efficiency, working seamlessly as a family of models

Thread 🧵
#MSBuild https://t.co/g3WQIcIQ24

136

3K

520

1K

368K

asadovsky retweeted

echen

@echen

3 days ago

big congrats to the microsoft AI team on MAI-Thinking-1! this is the kind of thoughtful post-training the field needs more of - focused on what actually matters to users excited to see a new frontier model in the race 😎 https://t.co/c4WxjGQBUk

echen's tweet photo. big congrats to the microsoft AI team on MAI-Thinking-1!

this is the kind of thoughtful post-training the field needs more of - focused on what actually matters to users

excited to see a new frontier model in the race 😎

https://t.co/c4WxjGQBUk https://t.co/A7ADZl3Yfa

0

25

1

872

Who to follow

Peter Zhu

@peterzhu2118

Staff Developer at Shopify. Ruby core committer. Photography geek. Mastodon: @[email protected] Bluesky: @peterzhu.ca Instagram: @peterzhu.photos

CEO & Cofounder of Tawkify - a personal concierge to your dating life

Adam Sadovsky @asadovsky

7 months ago

@dustinvtran Nice work!!

1

7

0

2K

asadovsky retweeted

Mustafa Suleyman

@mustafasuleyman

8 months ago

Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We're just getting started. https://t.co/33BiNfIjPg

mustafasuleyman's tweet photo. Meet our third @MicrosoftAI model: MAI-Image-1
#9 on LMArena, striking an impressive balance of generation speed and quality
Excited to keep refining + climbing the leaderboard from here!
We're just getting started.
https://t.co/33BiNfIjPg https://t.co/FMaXqiVIvS

34

502

77

114

147K

asadovsky retweeted

Nando de Freitas

@NandoDF

9 months ago

This was an amazing week at ⁦@MicrosoftAI⁩ !! We released MAI 1 preview and a taste of MAI Voice. I’m super happy with this team - only about 100 people and already shipping in ⁦@lmarena_ai⁩ in less than a year. Strong support. More soon. Thanks for feedback!

14

164

9

31

50K

Adam Sadovsky @asadovsky

9 months ago

hello, world!

Arena.ai

@arena

9 months ago

🚨Text Leaderboard Update: A new model provider, @MicrosoftAI has broken into the Top 15 this week! 💠MAI-1-preview by @MicrosoftAI debuts at #13. Congrats to the Microsoft AI team! As the Text Arena is one of the most competitive races, breaking into the Top 15 is no small feat. 💪

arena's tweet photo. 🚨Text Leaderboard Update:

A new model provider, @MicrosoftAI has broken into the Top 15 this week!

💠MAI-1-preview by @MicrosoftAI debuts at #13.

Congrats to the Microsoft AI team! As the Text Arena is one of the most competitive races, breaking into the Top 15 is no small feat. 💪

19

294

37

50

79K

4

108

4

7

21K

Adam Sadovsky @asadovsky

about 1 year ago

Interesting

Florian Brand

@xeophon

about 1 year ago

Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself META did not do a specific deployment / system prompt just for LMsys, did they? 👀

9

241

13

56

75K

0

3

0

1K

Adam Sadovsky @asadovsky

about 1 year ago

SOTA just got way cheaper

0

7

1

0

848

Adam Sadovsky @asadovsky

about 1 year ago

Quite interesting to see how some models generalize dramatically better!

Mislav Balunović @mbalunovic

about 1 year ago

Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.

mbalunovic's tweet photo. Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing. https://t.co/k6ePaqBbpy

36

964

143

170

304K

0

45

1

2

2K

asadovsky retweeted

Martin Baeuml

@mbaeuml

about 1 year ago

Just shipped a few updates 1. Gemini 2.5 Pro to try for free on https://t.co/LG1SUiRPBX in the model drop down. Advanced has higher limits. 2. Canvas with 2.5 Pro in Advanced. Our best coding model yet. We had so much fun building demos internally, can't wait to see what y'all come up with!

12

387

17

71

56K

Adam Sadovsky @asadovsky

about 1 year ago

Gemini 2.5 Pro is SOTA on pretty much everything

Silas Alberti

@silasalberti

about 1 year ago

Wow we just ran Gemini 2.5 Pro on our evals and it got a new state of the art. Congrats to the Gemini team! Sharing preliminary results here and working on bringing it into Devin:

silasalberti's tweet photo. Wow we just ran Gemini 2.5 Pro on our evals and it got a new state of the art. Congrats to the Gemini team!

Sharing preliminary results here and working on bringing it into Devin: https://t.co/4Wjl5wxqB7

40

1K

100

221

179K

8

341

20

31

27K

asadovsky retweeted

Bindu Reddy

@bindureddy

about 1 year ago

WE HAVE A NEW BEST MODEL IN THE WORLD! GEMINI 2.5 IS #1 ON LIVEBENCH

102

1K

161

193

165K

asadovsky retweeted

Arena.ai

@arena

about 1 year ago

BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆 Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn! Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌 More highlights in thread👇

arena's tweet photo. BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇

71

2K

396

302

468K

Adam Sadovsky @asadovsky

about 1 year ago

cook or die

0

11

0

381

asadovsky retweeted

Kyle Corbitt

@corbtt

about 1 year ago

If you're fine-tuning LLMs, Gemma 3 is the new 👑 and it's not close. Gemma 3 trounces Qwen/Llama models at every size! - Gemma 3 4B beats 7B/8B competition - Gemma 3 27B matches 70B competiton Vision benchmarks coming soon!

corbtt's tweet photo. If you're fine-tuning LLMs, Gemma 3 is the new 👑 and it's not close. Gemma 3 trounces Qwen/Llama models at every size!
- Gemma 3 4B beats 7B/8B competition
- Gemma 3 27B matches 70B competiton

Vision benchmarks coming soon! https://t.co/hDv83DLUvA

19

488

54

213

37K

Adam Sadovsky @asadovsky

about 1 year ago

Wow, quite impressive for a 27B model!

Arena.ai

@arena

about 1 year ago

🎉 Congrats to @GoogleDeepMind on Gemma-3-27B, the newest and one of the strongest open models in Arena! 💠 Top 10 overall - beating out many proprietary models with only 27B parameter 💠 2nd best open model only below DeepSeek-R1 💠 128K context window Check out their blog to learn more about Gemma 3. We can't wait to see where this goes next! 🔥👏

arena's tweet photo. 🎉 Congrats to @GoogleDeepMind on Gemma-3-27B, the newest and one of the strongest open models in Arena!

💠 Top 10 overall - beating out many proprietary models with only 27B parameter
💠 2nd best open model only below DeepSeek-R1
💠 128K context window

Check out their blog to learn more about Gemma 3. We can't wait to see where this goes next! 🔥👏

96

945

136

143

156K

0

52

0

3

3K

asadovsky retweeted

Subhash Choudhary

@subhashchy

over 1 year ago

We replaced GPT-4o with Gemini-2.0 Flash for Bot9, reducing our costs by about 20× with no visible loss in accuracy. This change was implemented on a highly complex support agent that makes 32 tool calls. I was seriously not expecting this. At the application layer, it also made us one of the top 10 apps built with Gemini worldwide — and the only one from India in the list. - Data source : Openrouter.

subhashchy's tweet photo. We replaced GPT-4o with Gemini-2.0 Flash for Bot9, reducing our costs by about 20× with no visible loss in accuracy.

This change was implemented on a highly complex support agent that makes 32 tool calls.

I was seriously not expecting this.

At the application layer, it also made us one of the top 10 apps built with Gemini worldwide — and the only one from India in the list.

- Data source : Openrouter.

39

1K

65

471

130K

asadovsky retweeted

Farzad Mostashari @Farzad_MD

over 1 year ago

1/ After residency at Mass General Hospital, I reported to Atlanta to meet my fellow CDC Epidemic Intelligence Service Officers. I have never felt so intimidated by my peers The best and the brightest, they were star clinicians, had served in disaster zones; MD/PhDs and MSF.

406

18K

5K

3K

3M

Adam Sadovsky

@asadovsky

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users