Arena.ai

Verified account

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring →

US

Joined March 2023

214 Following

163.8K Followers

3.2K Posts

Pinned Tweet

1 day ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

32

379

43

105

141K

about 1 hour ago

Dive into the Text-to-Image Arena leaderboard and filter by open models to see the results and data at: https://t.co/G1IeZKsywZ

0

4

0

0

1K

about 1 hour ago

In the Image Arena: open-weight Text-to-Image has a clear leader, with a tight race directly behind it: - #1 Ideogram-4.0 Quality has set the pace this week with a score of 1204. @ideogram_ai - #2 Hunyuan Image 3.0 by @TencentHunyuan with a score of 1151, just +1 pt ahead of Flux-2 Dev @bfl_ai at #3. - #4 Qwen Image 2512 by @Alibaba_Qwen and #5 HiDream-O1 Image @HiDream_AI complete the top five, scoring 1128 and 1124. The top six are represented by different labs, while Flux and Qwen provide the greatest depth across the Top 15.

arena's tweet photo. In the Image Arena: open-weight Text-to-Image has a clear leader, with a tight race directly behind it:

- #1 Ideogram-4.0 Quality has set the pace this week with a score of 1204. @ideogram_ai
- #2 Hunyuan Image 3.0 by @TencentHunyuan with a score of 1151, just +1 pt ahead of Flux-2 Dev @bfl_ai at #3.
- #4 Qwen Image 2512 by @Alibaba_Qwen and #5 HiDream-O1 Image @HiDream_AI complete the top five, scoring 1128 and 1124.

The top six are represented by different labs, while Flux and Qwen provide the greatest depth across the Top 15.

6

66

3

11

5K

about 3 hours ago

Start getting your real-world work done with the help of agents and help measure agentic AI advancement: https://t.co/8ujN06t7FN

0

5

2

0

1K

Who to follow

Verified account

Co-founder of Thinking Machines Lab @thinkymachines; Ex-VP, AI Safety & robotics, applied research @OpenAI; Author of Lil'Log

Databricks AI Research

Verified account

We remove the barriers to state-of-the-art generative AI model development and make data + AI available to all.

Verified account

We’ll help you make it like nobody’s business. Multimodal media generation and editing tools to get your idea to production. Self-deploy? 👍 Need a partner? 🤝

about 3 hours ago

Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode! Put models to work on your most complex real-world tasks, and see how they perform. Your sessions will help shape the Agent Arena leaderboard.

arena's tweet photo. Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode!

Put models to work on your most complex real-world tasks, and see how they perform.

Your sessions will help shape the Agent Arena leaderboard. https://t.co/5D6I9Xj0pS

1 day ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

32

379

43

105

141K

3

67

7

15

5K

about 3 hours ago

Check out who’s on the Agent Arena leaderboard so far: https://t.co/5PhJhhhUYI

2

6

1

2

2K

about 4 hours ago

Dive into the Text-to-Image Arena leaderboard details, and filter for the data points that matter most to you at: https://t.co/G1IeZKsywZ

0

5

0

0

2K

about 4 hours ago

Three new models entered the Image Arena Top 10 this past month (Text-to-Image): - #2 Reve 2.0 by @Reve (1,273), behind only GPT Image 2. - #4 MAI-Image-2.5 by @MicrosoftAI (1,253). - #9 Ideogram 4.0 Quality by @Ideogram_ai enters at #9 (1,204). And the only open-weights model in the top 10. Reve 2.0 and MAI-Image-2.5 displaced their own predecessors, as both previous generations dropped out of the Top 10 with these improvements.

arena's tweet photo. Three new models entered the Image Arena Top 10 this past month (Text-to-Image):
- #2 Reve 2.0 by @Reve (1,273), behind only GPT Image 2.
- #4 MAI-Image-2.5 by @MicrosoftAI (1,253).
- #9 Ideogram 4.0 Quality by @Ideogram_ai enters at #9 (1,204). And the only open-weights model in the top 10.

Reve 2.0 and MAI-Image-2.5 displaced their own predecessors, as both previous generations dropped out of the Top 10 with these improvements.

9

154

10

28

6K

about 4 hours ago

The three new entries bring different strengths across the Text-to-Image categories: - Reve 2.0 has the broadest profile, leading the three models in six of eight categories. Its clearest strengths are Text rendering, Commercial Design and Photorealistic Imagery. - MAI Image 2.5 leads in 3D Imaging and Art, while remaining competitive across the other categories. - Ideogram 4.0 Quality’s strongest relative results are in overall performance and Text Rendering.

1

12

1

2

2K

arena retweeted

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

62

1K

135

303

316K

about 7 hours ago

Start evaluating agentic AI on Arena today with Agent Mode at: https://t.co/8ujN06t7FN

0

5

0

0

2K

about 7 hours ago

Agentic AI is now evaluated in the Arena with Agent Mode and measured with Agent Arena. Founding Engineer Matt and Product Lead Ted show you Agent Mode in action: deep research, complex bash operations, whatever you throw at it. Every session contributes to the Agent Arena leaderboard. 00:00 What is Agent Mode 00:16 The task: explain a research paper PDF 00:38 Watching the agent work 01:47 The workspace panel 02:13 Exploring the generated site 03:18 Voting on agent tasks 03:54 Follow-up: explain like I'm five 04:58 How voting feeds the leaderboard

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

62

1K

135

303

316K

6

76

6

15

7K

about 7 hours ago

Dive into the Agent Arena leaderboard and see how agentic models perform in aggregate and across 5 different signals: - Confirmed Success - Praise vs Complaint - Steerability - Bash Recovery - Tool Hallucination https://t.co/5PhJhhhUYI

2

14

2

1

2K

about 22 hours ago

Start getting your real-world work done with the help of agents in Agent Mode at: https://t.co/8ujN06t7FN

1

9

1

1

3K

about 22 hours ago

Nemotron 3 Ultra has been added to the new Agent Mode! This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard.

arena's tweet photo. Nemotron 3 Ultra has been added to the new Agent Mode!

This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard. https://t.co/532FH1qqm6

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

62

1K

135

303

316K

7

104

7

12

11K

about 22 hours ago

Learn more about Nemotron 3 Ultra https://t.co/OFqPmuMBaJ

1 day ago

Today we're shipping Nemotron 3 Ultra. A 550B MoE frontier-intelligence open model built for long-running agents. It delivers 5x faster inference and lowers the cost of complex agentic tasks by up to 30% versus other open frontier models.

167

3K

424

1K

1M

1

16

0

1

6K

1 day ago

As we launch Agent Mode on Arena today, we want to celebrate the community that brought us here. Battle Mode - where it all started - just passed 50 million votes. Thank you.

1 day ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

32

379

43

105

141K

5

77

5

6

8K

1 day ago

The full Agent Arena Leaderboard is here: https://t.co/5PhJhhhUYI

6

20

2

4

4K

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

1 day ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

32

379

43

105

141K

62

1K

135

303

316K

1 day ago

Check out our technical blog for the Agent Arena methodology + a deep dive into how people delegate, correct, and steer agents: https://t.co/uKso7j00H3

2

20

2

3

6K

Last Seen Users on Sotwe

Trends for you

Most Popular Users