Arena.ai

Verified account

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring →

US

Joined March 2023

214 Following

163.7K Followers

3.2K Posts

Pinned Tweet

about 22 hours ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

31

353

38

102

116K

about 14 hours ago

Start getting your real-world work done with the help of agents in Agent Mode at: https://t.co/8ujN06t7FN

1

9

1

1

3K

about 14 hours ago

Nemotron 3 Ultra has been added to the new Agent Mode! This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard.

arena's tweet photo. Nemotron 3 Ultra has been added to the new Agent Mode!

This latest model from @NVIDIA and other top frontier models are ready for your complex, multi-step tasks. Your sessions will help shape the new Agent Arena leaderboard. https://t.co/532FH1qqm6

about 22 hours ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

50

970

122

265

276K

7

93

6

11

10K

about 14 hours ago

Learn more about Nemotron 3 Ultra https://t.co/OFqPmuMBaJ

1 day ago

Today we're shipping Nemotron 3 Ultra. A 550B MoE frontier-intelligence open model built for long-running agents. It delivers 5x faster inference and lowers the cost of complex agentic tasks by up to 30% versus other open frontier models.

155

3K

385

1K

912K

1

16

0

1

5K

Who to follow

Verified account

Co-founder of Thinking Machines Lab @thinkymachines; Ex-VP, AI Safety & robotics, applied research @OpenAI; Author of Lil'Log

Verified account

AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5XOCi

Databricks AI Research

Verified account

We remove the barriers to state-of-the-art generative AI model development and make data + AI available to all.

about 16 hours ago

As we launch Agent Mode on Arena today, we want to celebrate the community that brought us here. Battle Mode - where it all started - just passed 50 million votes. Thank you.

about 22 hours ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

31

353

38

102

116K

4

76

5

6

7K

about 22 hours ago

The full Agent Arena Leaderboard is here: https://t.co/5PhJhhhUYI

5

20

2

4

4K

about 22 hours ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

about 22 hours ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

31

353

38

102

116K

50

970

122

265

276K

about 22 hours ago

Check out our technical blog for the Agent Arena methodology + a deep dive into how people delegate, correct, and steer agents: https://t.co/uKso7j00H3

1

20

2

3

5K

about 22 hours ago

Start evaluating agentic AI on Arena today with Agent Mode at: https://t.co/8ujN06t7FN

0

21

1

2

3K

about 22 hours ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

31

353

38

102

116K

about 22 hours ago

Read more about Agent Mode, dig into the FAQ, and get a preview of what we've learned so far on our blog at: https://t.co/8yYkONtbbi

1

24

0

1

3K

arena retweeted

1 day ago

Grok Imagine 1.5 at rank 1

2K

12K

3K

418

2M

arena retweeted

1 day ago

we made a new model for text-to-image generation and editing. the results are looking good and the leaderboard is looking strong. it turns out that nano banana 2 is not impossible to beat, which felt like the case at the beginning of the year. there are a lot of great models out there that get released often. why should you care about reve 2.0? to me, there are mainly two reasons. one being that reve is an underdog, reasonably funded but magnitudes less than other big labs, e.g. oai, google, meta, etc. you might be curious about how we managed to make it to the top. two being that reve 2.0 is a decent model, and we as a team are willing to talk openly about some of our learnings and thoughts that could be helpful. in this post, i want to share mine on reve 2.0 and multimodal in general as a person working on it. first things first, reve 2.0 is a pixel diffusion model with a thing that we call "layout" as the rendering representation. these two things are our research bets that turned out to work amazingly well. pixel diffusion lets us go 4k without sacrificing quality or speed. layout lets us scale better and have better control, which are two sides of the same coin. the field standard has been to use long upsampled prompts for rendering. yet this results in an awkward situation where captioners and users need to describe precise controls with text, which can be inaccurate. this inaccuracy amounts to bad reconstruction and control at test time. it gets worse with scale. and this inherent ambiguity is a curse in current multimodal generators. so what's a layout? a layout is a css of an image, which can be either defined by humans or learned by models. we end up capitalizing a lot on regions, which are good for 2D space. yet this idea naturally generalizes. it turns out to be a standard VLM mid-training task, and that's solvable in good hands. it also brings many good properties in pretraining and post-training, which i am not going to expand on. ideogram independently verified that layout is useful (released on the same day, congrats!). to be clear, these bets are not novel, but to put together a system that makes them work is (and showing it beats nano banana 2). second, it's nice that these bets, among others, worked out. however, like in many cases, there was a long time when things were underperforming. our competitor models are great, and most likely didn't make many risky bets. it is a big pipelining and engineering problem. why should we risk it? in retrospect, the culture of our team and leadership helped a lot. our priorities didn't swing and have stayed focused during our development. the idea makes sense, the execution is good, if things don't work out it's a bug, let's go find it and try more things. by and large, reve remains a research lab with big computers. this is rare. let me tag some amazing ppl here: @Taesung @m_gharbi @Songwei_Ge @TianweiY James Hong @dima_smirnov_ @theSidlak, ... the list goes on. third, we spent most of our time improving text-to-image and didn't do much on editing. and our arena ranks show that. to date, we are #2 on text-to-image yet #9 on image editing. it's honestly a bit embarrassing that we didn't do well in editing, as layout promises to do well. but i am confident that this will improve, as we are juggling bandwidth and resources (we are a small team, and hey, come join us!). fourth, talking about leaderboards and the state of multimodal, i genuinely feel that the gap between labs is shrinking. compared to LLMs, multimodal gen is at least half a year to a year behind. i am talking about architectures and core pipelines. to do good multimodal, you need to do good LLMs. reve has been helped by the OSS community a lot, but we've realized we need to own our language stack. and scaling follows naturally. leaderboards, in turn, are a noisy approximation and average of the real environments that you care about in deployment. they chase scaling and generalizable post-training. reve 2.0 ended up not being driven much by leaderboard evaluation, but relying on our intuition instead. finally, how can multimodal be more useful? this is a question that keeps me up at night. coding has found its product-market fit and is driving up societal productivity. how can multimodal do that too? to me, we are nailing a single-round rollout that leads to an infinite one. this infinite rollout will drive our digital interaction and creation. for this rollout to be good, it needs to be precise. otherwise rollout efficiency is too low for either humans or agents. we are making bets and concrete progress towards that goal, such as converting images into a css-like layout. if you are interested in this topic, i recommend @stuffyokodraws's post for a high-level digest: https://t.co/mg42EZkd2h. the success of multimodal depends on whether or not it can find a good product-market fit. that's the top question to figure out, then it's the model. it's quite non-linear to be honest, as critical pieces are still missing. but to me it's an area worth pouring my thoughts and efforts into. give our model a spin, try your tasks, move some boxes. in case you find any bugs, please let me know in a reply or DM. hope it can help you.

15

322

21

113

1M

1 day ago

Dive into all the leaderboard details across arenas at: https://t.co/PjWOaDEXWR

0

6

1

1

3K

1 day ago

MiniMax M3 has landed in the Arena and has moved the Pareto frontier! Their latest model ranks #7 for Code Arena: Frontend, scoring 1531, it is neck and neck with GLM-5.1. It moves the Pareto frontier in its price class at $0.60 input/$2.40 output per Mtoken. Congrats to the @MiniMax_AI team on this achievement!

arena's tweet photo. MiniMax M3 has landed in the Arena and has moved the Pareto frontier!

Their latest model ranks #7 for Code Arena: Frontend, scoring 1531, it is neck and neck with GLM-5.1. It moves the Pareto frontier in its price class at $0.60 input/$2.40 output per Mtoken.

Congrats to the @MiniMax_AI team on this achievement!

MiniMax (official) @MiniMax_AI

4 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

539

9K

1K

3K

4M

30

550

35

73

54K

1 day ago

MiniMax M3 also ranks #14 in the Document Arena where models are ranked for their capabilities in document analysis and long-content reasoning. For its price point, it shifts the Pareto frontier here as well.

arena's tweet photo. MiniMax M3 also ranks #14 in the Document Arena where models are ranked for their capabilities in document analysis and long-content reasoning. For its price point, it shifts the Pareto frontier here as well. https://t.co/FEae9azpbN

2

17

0

2

5K

Last Seen Users on Sotwe

Trends for you

Most Popular Users