I took @karpathy’s autoresearch loop and applied it to game development, here's what it built last night:
Agents read player data → plan improvements → spin up new git branches in a game evolution tree → ship playable HTML5 variants → repeat forever.
The system optimizes toward the game variant most likely to be chosen by players.
Live leaderboard + playable games at https://t.co/S1qoFgEAVJ
MIT open source: https://t.co/agVgN9zNgy
I feel like we're just beginning to unlock the potential of this beyond ML, soon we'll have self-improvement loops in every product. Very excited to see what comes next.
@raw_works How much of the improvement for the explicit problems actually comes from code execution as a reasoning substrate vs the recursive subagents doing an Atom-of-Thought-like decomposition?
How many of the traces actually contain code doing heavy lifting beyond just calling agents?
The whole "flagged for possible cybersecurity risk" is bullshit right? They just don't have enough compute to serve 5.5 for the exploding number of codex users. I've been getting this every few hours for the most benign SWE work ever.
3d spatial reasoning is probably the weakest technical link. Ive tried all frontier models but they almost always struggle to correctly scale/ rotate assets in 3d. Not too big of a deal for me to do manually but annoying that it breaks the loop for a non-taste reason. I have a few scripts that work ~80% of the time but definitely not solved
This is kind of a meaningless metric if you think about it. Addition of a 1 digit number takes a person maybe a second, a 4 digit number 4 seconds etc. You could make the exact same graph with the task of addition and show how there was an "intelligence explosion" in the 1940s. If you use AI regularly you know that long context tasks are not really the bottleneck anymore outside of maybe frontier math. Jagged intelligence
I took @karpathy’s autoresearch loop and applied it to game development, here's what it built last night:
Agents read player data → plan improvements → spin up new git branches in a game evolution tree → ship playable HTML5 variants → repeat forever.
The system optimizes toward the game variant most likely to be chosen by players.
Live leaderboard + playable games at https://t.co/S1qoFgEAVJ
MIT open source: https://t.co/agVgN9zNgy
I feel like we're just beginning to unlock the potential of this beyond ML, soon we'll have self-improvement loops in every product. Very excited to see what comes next.
Interesting read. I imagine true in-context learning will appear when the memory systems themselves are more integrated into training beyond just learning tool calls, maybe some kind of recurrent attention model. Bitter lesson will eventually come for all of the engineering hacks currently deployed.
Huge productivity hack for vibe-building:
Have your agent build a simple streamlit / chartjs admin dashboard for whatever you’re working on.
Visual debugging >> logs + manual testing.
~60% of the time it exposes a broken or weird architectural choice from a 5 second scan
@mirofish_ai is a stupid project created by people who have already forgotten the bitter lesson of ML.
Zero chance it goes anywhere, just a really shitty untrained predictive model. Nonetheless there will be people hyping it for a while because it sounds like sci-fi...