co-founded @pulley in 2019. grew to 5k happy customers and $XXm ARR.
now on pat break with tiny new human. πΆ
exploring - ai eng, voice agents, edu games
@alienpisscrack interesting.
i'm also curious - given the known intelligence dropoff as context size grows, do the models perform better with the most concise/expressive languages (rust comes to mind).
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
@Dimillian@ilyasut also, while we're on the phone -- thomas, any chance you could pass along this bug report to the relevant colleague?
https://t.co/L8r4ob8lST
the codex desktop app has this gnarly bug where old threads aren't visible. luckily they're still saved in ~/.codex, so there's hope.
i tried to look at the relevant code and contribute a PR, but apparently the codex desktop app source is not public? seems odd.
time to pray to st. tibo @thsottiaux
@TheStalwart did you try it with the goals feature?
yesterday i gave codex a research task and it failed on the first attempt. i tried again with goals (and a new prompt with clear validation criteria) and it did a terrific job.
if you enjoy physics + startups, check this out.
on today's morning walk, i asked my voice agent to teach me wealth creation through the lens of thermodynamics.
after the walk, i gave the transcript to gpt-image-2 and generated this image.
@Vtrivedy10 it's not perfect - i still feel that fresh intelligence buzz when starting a new codex thread - but very very good. especially when running with the goals feature.
this is awesome, but also i remain convinced that everyone is still sleeping on raw gpt-image-2.
it's both 1) a reasoning model that just happens to output images, and 2) will accept up to 32k character inputs.
here are two outputs i generated, using this prompt: "<a long chatgpt summary of the x recommendation algo> teach me this visually, a very simple illustration with a cute blob character"
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
this is an interesting point in the new ted chiang piece β no one really claims that alphafold is conscious, or that sora or midjourney or dall-e are conscious