mark erdmann @markerdmann - Twitter Profile

about 3 hours ago

@alienpisscrack interesting. i'm also curious - given the known intelligence dropoff as context size grows, do the models perform better with the most concise/expressive languages (rust comes to mind).

1

0

8

mark erdmann

@markerdmann

about 4 hours ago

would be interesting to see a drill-down analyzing the impact of programming language used

Arena.ai

@arena

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

arena's tweet photo. Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

63

1K

136

304

318K

1

2

0

47

mark erdmann

@markerdmann

about 3 hours ago

@garybasin @nnnnicholas

1

0

18

mark erdmann

@markerdmann

about 4 hours ago

@_smcf poor claude is so bored

0

8

0

1K

Who to follow

Learning about myself thru trading. Proud Market Piker w/FT career. Member of AAT. #PODOR #INTJ

Chris Parsonson

@ChrisParsonson

Co-Founder/CEO @solveintel (YC S23) | Prev: @turinginst @instadeepai @Dyson - PhD @ucl, MRes @Cambridge_Uni, MEng @imperialcollege

mark erdmann

@markerdmann

about 4 hours ago

@Dimillian @ilyasut also, while we're on the phone -- thomas, any chance you could pass along this bug report to the relevant colleague? https://t.co/L8r4ob8lST

mark erdmann

@markerdmann

3 days ago

the codex desktop app has this gnarly bug where old threads aren't visible. luckily they're still saved in ~/.codex, so there's hope. i tried to look at the relevant code and contribute a PR, but apparently the codex desktop app source is not public? seems odd. time to pray to st. tibo @thsottiaux

markerdmann's tweet photo. the codex desktop app has this gnarly bug where old threads aren't visible. luckily they're still saved in ~/.codex, so there's hope.

i tried to look at the relevant code and contribute a PR, but apparently the codex desktop app source is not public? seems odd.

time to pray to st. tibo @thsottiaux

1

2

0

130

0

30

mark erdmann

@markerdmann

about 4 hours ago

@Dimillian @ilyasut was right, the models just want to build

1

0

68

mark erdmann

@markerdmann

about 4 hours ago

@deepfates luckily i'm an ai expert so i was not fooled by your misinformation on social media

0

30

mark erdmann

@markerdmann

about 4 hours ago

@TheStalwart did you try it with the goals feature? yesterday i gave codex a research task and it failed on the first attempt. i tried again with goals (and a new prompt with clear validation criteria) and it did a terrific job.

0

41

mark erdmann

@markerdmann

about 4 hours ago

if you enjoy physics + startups, check this out. on today's morning walk, i asked my voice agent to teach me wealth creation through the lens of thermodynamics. after the walk, i gave the transcript to gpt-image-2 and generated this image.

markerdmann's tweet photo. if you enjoy physics + startups, check this out.

on today's morning walk, i asked my voice agent to teach me wealth creation through the lens of thermodynamics.

after the walk, i gave the transcript to gpt-image-2 and generated this image. https://t.co/mgiekg6Mhf

0

1

0

57

mark erdmann

@markerdmann

about 4 hours ago

@Vtrivedy10 it's not perfect - i still feel that fresh intelligence buzz when starting a new codex thread - but very very good. especially when running with the goals feature.

0

1

0

59

mark erdmann

@markerdmann

about 5 hours ago

this is awesome, but also i remain convinced that everyone is still sleeping on raw gpt-image-2. it's both 1) a reasoning model that just happens to output images, and 2) will accept up to 32k character inputs. here are two outputs i generated, using this prompt: "<a long chatgpt summary of the x recommendation algo> teach me this visually, a very simple illustration with a cute blob character"

markerdmann's tweet photo. this is awesome, but also i remain convinced that everyone is still sleeping on raw gpt-image-2.

it's both 1) a reasoning model that just happens to output images, and 2) will accept up to 32k character inputs.

here are two outputs i generated, using this prompt: "<a long chatgpt summary of the x recommendation algo> teach me this visually, a very simple illustration with a cute blob character"

1

23

0

28

1K

mark erdmann

@markerdmann

about 5 hours ago

@paulg the map dies first, then the machine

0

21

mark erdmann

@markerdmann

1 day ago

great new leaderboard. this lines up well with my personal experience using these models.

Arena.ai

@arena

1 day ago

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

63

1K

136

304

318K

0

1

0

112

mark erdmann

@markerdmann

1 day ago

@levelsio aha got it. excited for it to grow!

0

140

mark erdmann

@markerdmann

1 day ago

@petergyang anthropic does not play well with others

0

32

mark erdmann

@markerdmann

2 days ago

@naterez94 @pangram ai?

1

2

0

76

markerdmann retweeted

kasra

@kasratweets

2 days ago

this is an interesting point in the new ted chiang piece – no one really claims that alphafold is conscious, or that sora or midjourney or dall-e are conscious