Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Today we announced MAI-Thinking-1, a strong generalist and reasoning LLM built from the ground up without distilling third-party models. 97% on AIME 2025; 53% on SWE-Bench Pro; preferred by human raters over Sonnet 4.6 (blind side-by-side).
Tech report: https://t.co/qxGQWX5cOt
Seven new models launching at Build: let’s go!
Reasoning. Code. Image. Transcribe. Voice.
Built from scratch on a clean data lineage, designed for efficiency, working seamlessly as a family of models
Thread 🧵
#MSBuild
big congrats to the microsoft AI team on MAI-Thinking-1!
this is the kind of thoughtful post-training the field needs more of - focused on what actually matters to users
excited to see a new frontier model in the race 😎
https://t.co/c4WxjGQBUk
Meet our third @MicrosoftAI model: MAI-Image-1
#9 on LMArena, striking an impressive balance of generation speed and quality
Excited to keep refining + climbing the leaderboard from here!
We're just getting started.
https://t.co/33BiNfIjPg
This was an amazing week at @MicrosoftAI !! We released MAI 1 preview and a taste of MAI Voice. I’m super happy with this team - only about 100 people and already shipping in @lmarena_ai in less than a year. Strong support. More soon. Thanks for feedback!
🚨Text Leaderboard Update:
A new model provider, @MicrosoftAI has broken into the Top 15 this week!
💠MAI-1-preview by @MicrosoftAI debuts at #13.
Congrats to the Microsoft AI team! As the Text Arena is one of the most competitive races, breaking into the Top 15 is no small feat. 💪
Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself
META did not do a specific deployment / system prompt just for LMsys, did they? 👀
Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.
Just shipped a few updates
1. Gemini 2.5 Pro to try for free on https://t.co/LG1SUiRPBX in the model drop down. Advanced has higher limits.
2. Canvas with 2.5 Pro in Advanced. Our best coding model yet. We had so much fun building demos internally, can't wait to see what y'all come up with!
Wow we just ran Gemini 2.5 Pro on our evals and it got a new state of the art. Congrats to the Gemini team!
Sharing preliminary results here and working on bringing it into Devin:
BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆
Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!
Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌
More highlights in thread👇
If you're fine-tuning LLMs, Gemma 3 is the new 👑 and it's not close. Gemma 3 trounces Qwen/Llama models at every size!
- Gemma 3 4B beats 7B/8B competition
- Gemma 3 27B matches 70B competiton
Vision benchmarks coming soon!
🎉 Congrats to @GoogleDeepMind on Gemma-3-27B, the newest and one of the strongest open models in Arena!
💠 Top 10 overall - beating out many proprietary models with only 27B parameter
💠 2nd best open model only below DeepSeek-R1
💠 128K context window
Check out their blog to learn more about Gemma 3. We can't wait to see where this goes next! 🔥👏
We replaced GPT-4o with Gemini-2.0 Flash for Bot9, reducing our costs by about 20× with no visible loss in accuracy.
This change was implemented on a highly complex support agent that makes 32 tool calls.
I was seriously not expecting this.
At the application layer, it also made us one of the top 10 apps built with Gemini worldwide — and the only one from India in the list.
- Data source : Openrouter.
1/ After residency at Mass General Hospital, I reported to Atlanta to meet my fellow CDC Epidemic Intelligence Service Officers.
I have never felt so intimidated by my peers
The best and the brightest, they were star clinicians, had served in disaster zones; MD/PhDs and MSF.