More of the iOS app loop, now inside Codex.
The Build iOS Apps plugin lets Codex view and test your iOS app in the in-app browser, open SwiftUI previews, and hot reload edits without leaving Codex.
insane ball knowledge in codex
I just found out @wonforall has a skill called $kobe that spawns off 3 subagents to discuss / review his code, each of which is build to represent one of our principal engineers
on tuned in on his past code reviews.
I'm going to start doing this with @dkundel and @charlierguo for our docs...
In @latentspacepod podcast, I shared my view on video generation, world models, LLMs, agents, continual learning and where the next frontier is.
1. Video models get most of their intelligence from language, not from video data.
2. Idea-to-code is fast now. The bottleneck is back to having enough compute to try every idea.
3. Iteration speed beats almost everything else in model development.
4. The next leap won't be a better video model. It'll be a video agent.
5. Diffusion will be the frontend of AGI, the LLM the backend. Generative UI will replace HTML/CSS: user intent straight to pixels.
6. Physical embodiment may become a tool a powerful AI picks up. Robotics may get solved by video-capable LLMs.
7. Continual learning may look like models that manage their own context, and even rewrite their own harness at test time.
Thanks @swyx and @vibhuuuus for having me 🙏
https://t.co/mLuvbODJxA
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.