Joan Gamell

@gamell

Software Engineer @arena | Photography | Bon Vivant

California

Joined February 2007

1.1K Following

743 Followers

45 Posts

gamell retweeted

3 days ago

Exciting news: Claude Fable 5 ranks #1 on the new Agent Arena leaderboard! Fable 5 leads by the widest margin ever over Opus-4.8 and GPT-5.5 on two key signals: confirmed task success rate and praise vs. complaint, despite weaker steerability. If Fable can do something, it will do it very well. If it can't/doesn't want to do something, it may be hard to steer the model towards the goal. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks. Models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model. Huge congrats to @AnthropicAI for the incredible milestone! Below we break down how Claude Fable 5 (based on Mythos) scored across 5 signals, drawn from tasks submitted by a global community of users.

arena's tweet photo. Exciting news: Claude Fable 5 ranks #1 on the new Agent Arena leaderboard!

Fable 5 leads by the widest margin ever over Opus-4.8 and GPT-5.5 on two key signals: confirmed task success rate and praise vs. complaint, despite weaker steerability. If Fable can do something, it will do it very well. If it can't/doesn't want to do something, it may be hard to steer the model towards the goal.

In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks. Models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model.

Huge congrats to @AnthropicAI for the incredible milestone! Below we break down how Claude Fable 5 (based on Mythos) scored across 5 signals, drawn from tasks submitted by a global community of users.

18

515

57

88

372K

gamell retweeted

9 days ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

75

1K

150

332

490K

Joan Gamell @gamell

9 days ago

"Build an Agent testing framework, make no mistakes" Not as easy as it sounds! Congrats team!!

9 days ago

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

43

455

49

127

249K

0

2

0

0

48

gamell retweeted

about 1 month ago

A brief history of me

keyboardJones's tweet photo. A brief history of me https://t.co/WVvSc7DJlN

0

1

1

0

29

Who to follow

@wassieailouros

Wassie Cat, the elusive fridge dweller, has made a cozy home in the depths of the icebox

Having fun with new tech. The future is fractal.

gamell retweeted

about 2 months ago

It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free.

arena's tweet photo. It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free. https://t.co/Z8F6mTtJSS

47

1K

57

141

280K

gamell retweeted

about 2 months ago

Exciting news - DeepSeek V4 Pro is in the Arena with 1.6T parameters (49B activated) alongside V4 Flash at 284B parameters (13B activated). Both support 1M token context. It’s a major leap over DeepSeek V3.2! Code Arena: - DeepSeek V4 Pro (thinking): #3 open model (#14 overall), on par with GPT-5.4-high and Gemini-3.1-Pro in agentic webdev tasks Text Arena: - DeepSeek V4 Pro (thinking): #2 open model (#14 overall), matching Kimi-2.6 - DeepSeek V4 Flash (thinking): #10 open model (#47 overall) Competition at the top of the open model leaderboards keeps heating up. Huge congrats to @DeepSeek_AI on the strong comeback!

arena's tweet photo. Exciting news - DeepSeek V4 Pro is in the Arena with 1.6T parameters (49B activated) alongside V4 Flash at 284B parameters (13B activated). Both support 1M token context. It’s a major leap over DeepSeek V3.2!

Code Arena:
- DeepSeek V4 Pro (thinking): #3 open model (#14 overall), on par with GPT-5.4-high and Gemini-3.1-Pro in agentic webdev tasks

Text Arena:
- DeepSeek V4 Pro (thinking): #2 open model (#14 overall), matching Kimi-2.6
- DeepSeek V4 Flash (thinking): #10 open model (#47 overall)

Competition at the top of the open model leaderboards keeps heating up. Huge congrats to @DeepSeek_AI on the strong comeback!

56

2K

150

306

288K

gamell retweeted

about 2 months ago

Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards! A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date. - #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search aka gemini-3.1-flash-image) - #1 Single-Image Edit (1513), +125 over #2 (Nano-banana-pro aka gemini-3-pro-image) - #1 Multi-Image Edit (1464), +90 over #2 (Nano-banana-2) No model has dominated Image Arena with margins this wide. Huge congratulations to @OpenAI on this major breakthrough in image generation! More performance breakdowns by category in the thread below.

arena's tweet photo. Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards!

A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date.

- #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search aka gemini-3.1-flash-image)
- #1 Single-Image Edit (1513), +125 over #2 (Nano-banana-pro aka gemini-3-pro-image)
- #1 Multi-Image Edit (1464), +90 over #2 (Nano-banana-2)

No model has dominated Image Arena with margins this wide.

Huge congratulations to @OpenAI on this major breakthrough in image generation! More performance breakdowns by category in the thread below.

209

6K

634

1K

2M

gamell retweeted

2 months ago

We’ve added Pareto frontier charts to the leaderboard. Now available across: Text, Vision, Search, Document, and Code Arena. The Pareto frontier curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens (3:1 Ratio). Text Models on the Pareto Frontier curve today: - Claude Opus 4.6 - Gemini 3.1 Pro Preview - Grok 4.20 Beta Reasoning - Gemini 3 Flash - Grok 4.1 Thinking - DeepSeek v3.2 Exp Thinking - Deepseek v3.2 - Mimo v2 Flash (non-thinking) - Gemma-3-27b-it - Gemma-3-12b-it - GPT-OSS-20b - Gemma 3n-e4b-it Model count by Lab for Text Arena: - 5 models @GoogleDeepMind - 2 models from @xai - 2 models from @deepseek_ai - 1 @AnthropicAI - 1 @Xiaomi - 1 @OpenAI Dig into the Pareto frontier for Document, Search, Video or Code Arena, by lab, or the expert categories most relevant to you.

arena's tweet photo. We’ve added Pareto frontier charts to the leaderboard.

Now available across:
Text, Vision, Search, Document, and Code Arena.

The Pareto frontier curve demonstrates which models are most efficient at their level of performance (by Arena score) vs. a blended price per 1M tokens (3:1 Ratio).

Text Models on the Pareto Frontier curve today:
- Claude Opus 4.6
- Gemini 3.1 Pro Preview
- Grok 4.20 Beta Reasoning
- Gemini 3 Flash
- Grok 4.1 Thinking
- DeepSeek v3.2 Exp Thinking
- Deepseek v3.2
- Mimo v2 Flash (non-thinking)
- Gemma-3-27b-it
- Gemma-3-12b-it
- GPT-OSS-20b
- Gemma 3n-e4b-it

Model count by Lab for Text Arena:
- 5 models @GoogleDeepMind
- 2 models from @xai
- 2 models from @deepseek_ai
- 1 @AnthropicAI
- 1 @Xiaomi
- 1 @OpenAI

Dig into the Pareto frontier for Document, Search, Video or Code Arena, by lab, or the expert categories most relevant to you.

14

207

25

53

18K

gamell retweeted

3 months ago

Did you know? We’re funding independent research in AI evaluation and measurement—up to $50k per project. The Q1 deadline to apply for Arena’s Academic Partnerships Program is March 31.

arena's tweet photo. Did you know? We’re funding independent research in AI evaluation and measurement—up to $50k per project.

The Q1 deadline to apply for Arena’s Academic Partnerships Program is March 31. https://t.co/RbDG56ZUDI

1

30

7

19

13K

gamell retweeted

3 months ago

Today we’re launching the Video Edit Arena to evaluate the frontier capability of video models! - #1 Grok-Imagine-Video, @xAI - #2 Kling-o3-pro, @Kling_ai - #3 Kling-o1-pro, @Kling_ai - #4 Gen4-aleph, @Runwayml The leaderboard is powered by thousands of real-world community votes. Click the Edit button in Video Arena to edit any video and compare top model outputs. More models coming soon!

arena's tweet photo. Today we’re launching the Video Edit Arena to evaluate the frontier capability of video models!

- #1 Grok-Imagine-Video, @xAI
- #2 Kling-o3-pro, @Kling_ai
- #3 Kling-o1-pro, @Kling_ai
- #4 Gen4-aleph, @Runwayml

The leaderboard is powered by thousands of real-world community votes. Click the Edit button in Video Arena to edit any video and compare top model outputs.

More models coming soon!

17

194

18

22

26K

gamell retweeted

4 months ago

Today, we’re launching a dedicated Multi-File React leaderboard. When Code Arena first launched, we evaluated models on single-file HTML. Then we raised the bar → multi-file React apps (routing, hooks, components, state management) and now have a leaderboard to match! Single-file HTML tests instruction following. Multi-file React tests more complex agentic capabilities: • Cross-file coordination • Component architecture • Dependency management • State management • Build reliability Now see how models stack up under both scenarios. Learn more in thread 🧵👇

arena's tweet photo. Today, we’re launching a dedicated Multi-File React leaderboard. When Code Arena first launched, we evaluated models on single-file HTML.

Then we raised the bar → multi-file React apps (routing, hooks, components, state management) and now have a leaderboard to match!

Single-file HTML tests instruction following.
Multi-file React tests more complex agentic capabilities:
• Cross-file coordination
• Component architecture
• Dependency management
• State management
• Build reliability

Now see how models stack up under both scenarios. Learn more in thread 🧵👇

11

143

8

19

12K

gamell retweeted

4 months ago

⚡️Who powers the Arena leaderboard? You do. But not all votes are Arena level research-grade quality. Every score is built from real-world prompts and human input, continuously refreshed as the way we use AI evolves. In this video, ML Scientist @cthorrez explains how votes are processed and validated into high-quality data trusted by leading AI labs. Dive into: ▪️ Filtering low-signal and bot-like activity ▪️ Our quality control and validation pipeline ▪️ Ensuring all votes are based on blind pairings ▪️ How diverse user prompts differ from static benchmarks Arena scores reflect how models perform in the wild: across coding, creative work, expert tasks, and everyday use. Watch the breakdown to see how your vote helps shape the frontier.

3

47

4

3

7K

gamell retweeted

4 months ago

✨NEW: Arena Leaderboard UI Updates Millions of votes power the leaderboard. Now you can filter for what matters to you. A new side panel lets you filter and break down ranked results to find the best model for your task. Some highlights: • Filter by category (e.g. Coding, Expert prompts) • Open vs. Proprietary Models • Rank labs by their top-performing models …just to name a few. Check it out, and let us know what you think.

8

95

13

9

8K

gamell retweeted

4 months ago

The new @xAI Grok-Imagine-Image model is a Pareto-optimal model in Image Arena: The Pareto frontier tells us which model has the highest Arena score at each price point. @xAi’s latest models have improved the frontier, giving optimal performance in the mid-price tier. For a wide range of prices between 2c and 8c per image, @elonmusk’s @xAI has the leading model, delivering the maximum performance. Top models on the Pareto frontier for Image Arena (Single Image Edit): - @OpenAI: GPT-Image-1.5-high-fidelity - @xAI: Grok Imagine Image Pro - @xAI: Grok Imagine Image - @bfl_ml: Flux 2 Klein 9B - @bfl_ml: Flux-2-Dev - @reve : V1.1 Fast See thread for how the frontier changes for Text-to-Image 🧵

arena's tweet photo. The new @xAI Grok-Imagine-Image model is a Pareto-optimal model in Image Arena:

The Pareto frontier tells us which model has the highest Arena score at each price point. @xAi’s latest models have improved the frontier, giving optimal performance in the mid-price tier. For a wide range of prices between 2c and 8c per image, @elonmusk’s @xAI has the leading model, delivering the maximum performance.

Top models on the Pareto frontier for Image Arena (Single Image Edit):
- @OpenAI: GPT-Image-1.5-high-fidelity
- @xAI: Grok Imagine Image Pro
- @xAI: Grok Imagine Image
- @bfl_ml: Flux 2 Klein 9B
- @bfl_ml: Flux-2-Dev
- @reve : V1.1 Fast

See thread for how the frontier changes for Text-to-Image 🧵

131

926

150

209

10M

gamell retweeted

4 months ago

🚨BREAKING: Claude Opus 4.6 by @AnthropicAI is now #1 across Code, Text and Expert Arena! Opus 4.6 shows significant gains across the board: - #1 Code Arena: +106 score vs Opus 4.5 - #1 Text Arena: scoring 1496, +10 vs Gemini 3 Pro - #1 Expert Arena: +~50 lead Congrats to the @AnthropicAI team on the incredible milestone! The frontier just moved.

arena's tweet photo. 🚨BREAKING: Claude Opus 4.6 by @AnthropicAI is now #1 across Code, Text and Expert Arena!

Opus 4.6 shows significant gains across the board:
- #1 Code Arena: +106 score vs Opus 4.5
- #1 Text Arena: scoring 1496, +10 vs Gemini 3 Pro
- #1 Expert Arena: +~50 lead

Congrats to the @AnthropicAI team on the incredible milestone! The frontier just moved.

28

728

64

82

213K

gamell retweeted

4 months ago

👋Say hello to Max! Max is Arena’s intelligent router, powered by 5+ million real-world community votes. Max routes each prompt to the most capable model with latency in mind. AI models excel at different things (code, math, speed, reasoning). Max orchestrates across model strengths to deliver reliable performance across real-world use cases. Available today in Direct chat!

8

229

20

62

31K

Joan Gamell @gamell

5 months ago

@arena Absolute CLASS

0

4

0

0

627

gamell retweeted

5 months ago

LMArena is now Arena. A name that takes us back to our roots with a powerful mission: to measure and advance the frontier of AI for real-world use. We have grown from a small PhD research project to a platform powered by a global community of millions. This rebrand has been shaped by the people who use it. 👇 Take a look inside the rebrand.

88

1K

108

286

307K

gamell retweeted

5 months ago

🚨BIG NEWS: 🎬 Video Arena is now live on the web! Test out Veo 3.1, Sora 2, Seedance v1.5 Pro, Kling 2.6 Pro, Wan 2.5 & more. What started last summer as a small Discord bot experiment has grown into a rigorous way to measure and understand how frontier video models perform with real-world use. Thank you to our wonderful community for all the feedback! Today, we’re opening up access by making it available on the web. 🎥 Generate videos with 15 different frontier AI models and compare them head-to-head. 📊 Vote for the best output to power the leaderboards.

21

210

32

59

62K

gamell retweeted

5 months ago

Who’s actually leading the AI race? It depends on which leaderboard you look at. On Arena’s Text leaderboard (since May 2023): 🔹@OpenAI leads 74% of the time 🔹@GoogleDeepMind 21% 🔹@AnthropicAI 5% But zoom into Expert prompts (~5% of the hardest real-world tasks) and the story flips. 👇 On Arena’s Expert Text leaderboard (since March 2024): 🔸@AnthropicAI leads 48% of the time 🔸@OpenAI 37% 🔸@GoogleDeepMind 12% 🔸@Deepseek_AI 4% The takeaway: Different tasks. Different winners.

19

166

14

39

24K

Last Seen Users on Sotwe

Trends for you

Most Popular Users