Wokes don't like me because I'm pragmatic towards AI and I prefer my video game characters looking hot. Chuds don't like me because furry with pronouns in bio. This online culture war stuff is tough.
The current goal is implementation of the sprites/animations of five buildings and two items. Sphere Harvester (produces Sphere), Soul Condenser (converts Sphere to Sentience Core) and Subterranean Storage (just a way to get items off the map for now), plus Inserter/Belts.
So for the buildings, since idk how you do perspective with a resolution this low, if you even can, I'm doing some form of oblique projection(?) combined with a very limited color palette. The SimCity 1989 way. Just with goo on top.
I can't express it visually due lack of skill, but conceptually I want the conveyor belts to be organic alien tentacle things, ever undulating forward.
Liberal
Liberal, liberal, liberal, liberal
Liberal, liberal, liberal, liberal
Happy fun, la-la-la
Human smuggling, fentanyl deaths
Forced government euthanasia
La ch-ch... chopping up children's genitals
People want me to pretend that this isn't literal Star Trek technology. Telling the computer a task in natural language, and it completes it. Like, come on.
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.