Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.
In the Image Arena: open-weight Text-to-Image has a clear leader, with a tight race directly behind it:
- #1 Ideogram-4.0 Quality has set the pace this week with a score of 1204. @ideogram_ai
- #2 Hunyuan Image 3.0 by @TencentHunyuan with a score of 1151, just +1 pt ahead of Flux-2 Dev @bfl_ai at #3.
- #4 Qwen Image 2512 by @Alibaba_Qwen and #5 HiDream-O1 Image @HiDream_AI complete the top five, scoring 1128 and 1124.
The top six are represented by different labs, while Flux and Qwen provide the greatest depth across the Top 15.
Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode!
Put models to work on your most complex real-world tasks, and see how they perform.
Your sessions will help shape the Agent Arena leaderboard.
Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.
Three new models entered the Image Arena Top 10 this past month (Text-to-Image):
- #2 Reve 2.0 by @Reve (1,273), behind only GPT Image 2.
- #4 MAI-Image-2.5 by @MicrosoftAI (1,253).
- #9 Ideogram 4.0 Quality by @Ideogram_ai enters at #9 (1,204). And the only open-weights model in the top 10.
Reve 2.0 and MAI-Image-2.5 displaced their own predecessors, as both previous generations dropped out of the Top 10 with these improvements.
The three new entries bring different strengths across the Text-to-Image categories:
- Reve 2.0 has the broadest profile, leading the three models in six of eight categories. Its clearest strengths are Text rendering, Commercial Design and Photorealistic Imagery.
- MAI Image 2.5 leads in 3D Imaging and Art, while remaining competitive across the other categories.
- Ideogram 4.0 Quality’s strongest relative results are in overall performance and Text Rendering.
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Agentic AI is now evaluated in the Arena with Agent Mode and measured with Agent Arena.
Founding Engineer Matt and Product Lead Ted show you Agent Mode in action: deep research, complex bash operations, whatever you throw at it. Every session contributes to the Agent Arena leaderboard.
00:00 What is Agent Mode
00:16 The task: explain a research paper PDF
00:38 Watching the agent work
01:47 The workspace panel
02:13 Exploring the generated site
03:18 Voting on agent tasks
03:54 Follow-up: explain like I'm five
04:58 How voting feeds the leaderboard
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Dive into the Agent Arena leaderboard and see how agentic models perform in aggregate and across 5 different signals:
- Confirmed Success
- Praise vs Complaint
- Steerability
- Bash Recovery
- Tool Hallucination
https://t.co/5PhJhhhUYI
Nemotron 3 Ultra has been added to the new Agent Mode!
This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard.
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Today we're shipping Nemotron 3 Ultra.
A 550B MoE frontier-intelligence open model built for long-running agents.
It delivers 5x faster inference and lowers the cost of complex agentic tasks by up to 30% versus other open frontier models.
As we launch Agent Mode on Arena today, we want to celebrate the community that brought us here.
Battle Mode - where it all started - just passed 50 million votes.
Thank you.
Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.
Check out our technical blog for the Agent Arena methodology + a deep dive into how people delegate, correct, and steer agents: https://t.co/uKso7j00H3