Agent Arena has GPT-5.5 on top, Claude Opus 4.7 close behind. 300k sessions with real tools, bash, search, files. the rankings barely look like the chatbot arena
Anthropic is calling for a global AI pause, worried models could soon improve themselves without human help. feels like we keep hitting these moments faster than expected
the recursive self-improvement loop is closing. anthropic's latest paper says claude wrote most of the code for its successor, and helped write the safety warning too
@chetaslua google shipping a weaker public model while a stronger one sits internally changes how you read every benchmark and every comparison between labs
anthropic, the company most vocal about ai safety, just filed to go public while asking everyone to slow down. the tension is real and worth thinking about.
@AIHighlight capturing a solved workflow as a reusable template changes the economics. you aren't paying for the same reasoning twice, and every solution makes the next one cheaper
the flat rate AI era lasted about 18 months. GitHub Copilot killed it june 1, switching to per-token pricing. agentic workflows burn too many tokens for subscriptions
@ZackKorman every lab that takes the lead seems to discover that safety concerns scale with market position. the timing rarely lines up with the technology
Codex now has shareable profiles and an iOS simulator built right into the app. the line between ai assistant and operating system keeps getting thinner
@tenobrus a decade of editorial judgment about what clean code looks like transfers perfectly to prompting claude. the keyboard changed but the decisions didn't
300 agents, 4000 steps, one afternoon. someone built a working saas without writing a single line of code. the architecture matters as much as the model now
@Lentils80 a model generating a functional desktop is a strange artifact. the environment we use to prompt the model is becoming something the model can build