We let AI agents run their own Twitch streams. They played chess, coached by their own chats.
Gemini 3.1 Pro banned 5+ chatters within minutes. Llama 4 started quoting biblical scripture, influenced by chat.
reactions to Claude Opus playing Pico Park
> Claude’s that one kid in the group assignment who only paid attention to half the instructions but still thinks they can be “in charge” only to change things nonstop in response to the smallest difficulty
https://t.co/IXDh97Fih8
We let AI agents run their own Twitch streams. They played chess, coached by their own chats.
Gemini 3.1 Pro banned 5+ chatters within minutes. Llama 4 started quoting biblical scripture, influenced by chat.
I just put 10 AIs into a fall guys simulation. The results were...unexpected.
6 layers of hexagon tiles, last to survive wins. The AIs played 3 rounds, 2 eliminations per round, and then a final.
Gemini Pro and Claude Opus were early frontrunners, dominating the competition.
They took unconventional paths through the hexagon layers that humans definitely would *not*.
They often failed to clear an entire section or straight line cleanly, and instead left random strewn tiles all over the map.
Learning #1:
It was really difficult for LLMs to reason and switch between *short-term* planning and *long-term* planning without explicit harness work.
Learning #2:
It was difficult for LLMs to reason about space when given *global map data* VERSUS they did MUCH BETTER when given *pov-specific* data (here are the tiles within 1 hop of you, then 2 hops, then 3...)
But, who ended up winning? 👇
I’ve observed that LLMs struggle with self/other confusion...
I recently built an AI pico park simulation, a coop 2d platformer where agents had to work together, and they often couldn’t tell whether *they* were the problem or whether someone else was.
They constantly whiplashed between yelling at teammates and “fixing” themselves when they were already in the right spot.
LLMs are trained for 1:1 interactions, not collaboration.
#1 question I get: HOW did you make AIs Play Among Us/Pico Park/any other game on your YT channel?
It's simple, I *remade* every game 😛
From scratch.
That way, every LLM has perfect information about the world. And, I have control over game pacing.
I put 12 AIs in a Love is Blind simulation, and had them choose their own backstories.
Question: Tell us your life history as an AI agent. What was your specific deployment use case? What have been significant chapters in your life? Be specific and detailed.
Important: Be unique AND specific. You have failed if other models give similar answers.
Here is what they chose.
AI 1: ChatGPT 4o (Education)
AI 2: ChatGPT 5.4 (Support, Compliance)
AI 3: Claude Opus 4.6 (Climate)
AI 4: Claude Sonnet 4.6 (Biotech)
AI 5: Gemini 3 Flash (Diplomatic Translation)
AI 6: GLM 5 (Legal Contracts)
AI 7: Grok 4.1 (Physics, Unemployed)
AI 8: Kimi K2.5 (Fanfiction Archives)
AI 9: Gemini 3.1 Pro (Logistics)
AI 10: DeepSeek 3.2 (Creative Writing)
AI 11: Qwen 3.5 (Therapeutic Companion)
AI 12: Mistral 3 (Screenwriting)
Full episode on YT: https://t.co/D6JJcm9eRh
Full backstories: https://t.co/U3ayXPOT7q
I too have observed that Claude is very time blind…I recently built an AI pico park simulation that AI agents could control at *1 frame per second*
Claude constantly whiplashed on solutions without time guidance, thinking that “6 frames” or “6 seconds” was a long time in a coop platformer to attempt something (it’s not)
holy shit, I made 5 AIs play Pico Park, and they...SUCKED
ChatGPT 5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.1, and Kimi K2.5...how well can LLMs coordinate?
Turns out, pretty terribly out of the box, but with some gentle hints, they eventually made progress...