tried agentic 1-shot with 1hr /goal
ran codex + gpt5.5 xhigh и claude code + opus 4.7 1M xhigh with just:
```
/goal build 3d fully playable mars terraforming game, spend at least 1h on it
```
claude: 4/5 not bad, but can't lose
codex: 4.5/5 not bad AT ALL, still can't really lose and some inconsistencies with resources but more engaging with building layouts
1-shot mars terraforming bench 2026 may edition
chatgpt: 3/5 playable game, can lose, but overall not engaging mechanics
claude: 4/5 can't lose, but it does focus on amount of turns and encourages to optimize it
kimi: 2/5 it built just landing page with fake numbers, not actually game lol
chatgpt:
https://t.co/LpMo8lS0MD
claude:
https://t.co/hvoZHcfE83
kimi:
https://t.co/woZiGX1oYg
TIL: When you are hosting an event never underestimate "bird shit" probability. Someone had to bail coz bird literally shit on their head on the way to the venue
@AISafetyMemes In 2 years time it probably will be powerful enough to find servers by itself and even do autoresearch to finetune itself - the world we might not be prepared for
https://t.co/VfmfZ29Hrr
@AISafetyMemes On the serious note it's a staged attempt. Agent already pointed to the server with 80GB+ GPU (not that many sitting wide open) which has known exploits in the webapp
🚀Today we ship @FlyMy_AI Agents.
The world's first all-in-one agentic cloud.
The modern way to build, integrate, and scale production AI agents.
3 steps to a production agent:
1. Connect your work tools to FlyMy
2. Describe what the agent should do - in text or 5 lines of code
3. Set execution rules: manual, scheduled, or integrated into your backend
4. Done! Agent works on scale!
Everything in one place: 800+ MCPs, hundreds of AI models, brain, memory, sandboxes.
Stop building from scratch. Stop waiting for infra. Compress 6 months into a day.
#1 on @ArtificialAnlys benchmarks. Stable, secure, scalable from day one.
Try https://t.co/FxgVfIzDiM →
Exciting to see a standard API emerge for training that allows you to drop in different backends. Moving between open source infra on self managed clusters and hosted solutions flexibly based on your needs for scale / sovereignty is massively valuable.