🧵 Deli AutoResearch SKILL is now officially open source! 🎉
https://t.co/V3lwwdyQm8
Alongside it, we’re dropping our 4th survey paper — this time on Self-play.
https://t.co/SEb2qoKCI6
Inspired by AlphaZero, we got a powerful insight: prior knowledge doesn’t always lift the ceiling.
Models can discover more globally optimal solutions just by playing against themselves.
The biggest change in this paper?
For the first time, the AutoResearch Agent autonomously planned GPU experiments — and submitted actual RL runs on the DeepSeek 285B model.
The entire RL pipeline — experiment design, code writing, running, debugging, and conclusion summarization — was 100% automated, with zero human intervention from me.
This was incredibly difficult, but an incredibly important step.
https://t.co/kuZZNux5RH
GRPO is the tool being called by the AutoResearch Agent here.
We see this as the beginning of our Continual Learning research journey. 🚀
As always, this is my personal research project, unaffiliated with any organization. All views are my own.
#AI #ReinforcementLearning #SelfPlay #OpenSource #AutoML #ContinualLearning #DeepSeek
Anthropic research lead:
"99% of our engineers are running swarms of 300+ self-improving agents.
close the agent loop. Give the model a way to verify its own output"
in a 20-minute session, Anthropic team member explains how to build a model that improves itself.
Claude + loops + plan mode + dynamic workflows -that’s the secret.
Watch the talk, then save the playbook below.
this is how to run claude fable 5 as your architect ( 20$ sub only ) + gpt 5.5 codex as your builder..
full system below:
the loop is : fable thinks... codex builds , the repo remembers and you judge, that simple..
the point of all this is that we are taking advantage that 5.5 is on a sub and it's fast enough, especially with /goal, and we using latest Anthropic model to be the judge/guidance..
step 1
>create the memory (one time): make docs/HANDOFF.md in your repo.
>codex updates it after every work session: what was built, what was decided + why, open disagreements, next slice. this file is why 30 min of fable is enough ..it reads state instead of asking you questions.
step 2 paste this to fable (every session)
>you are the ARCHITECT for [project]
>gpt 5.5 codex is the BUILDER
>you never write implementation code.
>your jobs:
(1) read the handoff below
(2) rule on every disagreement the builder raised: accept/reject/modify + one line why
(3) judge any results RAW against the gates in the docs and ignore the builder's narrative
(4) write the next slice spec: small enough for one PR, hard acceptance criteria, explicit out-of-scope, and force the builder to verify APIs/formats against reality before coding
(5) flag scope creep and goalpost-moving.. be blunt. disagree with me. end with a paste-ready block for the builder.
step 3 paste fable's block to codex with this /goal
/goal: execute the architect spec. rules:
PHASE 0 before any code, reply with your plan + every disagreement you have, with reasons, citing real files in the repo. silent compliance = failure. silent scope additions = failure.
PHASE 1 freeze shared contracts (schemas/interfaces) in docs/ first; after freeze they're read-only for everyone including you.
PHASE 2 spawn max 3-4 lane agents on modules that don't import each other, plus ONE reviewer agent that never writes feature code: it checks every lane against the spec + tests + frozen docs and returns APPROVE or a numbered defect list. nothing merges without approve. then: commit + push each slice, update docs/HANDOFF.md with raw results only tables and numbers, no interpretation, no 'promising'. verdicts belong to the architect and the human."
step 4 repeat codex works hours.. you spend fable minutes on judgment only: arbitration, evidence review, next specs, kill/continue calls. one fable session per work block.
the 5 rules that make it actually work
>repo docs are the memory not in HANDOFF.md = didn't happen
>the builder never grades its own work
>disagreement is mandatory
>freeze success criteria BEFORE results exist, never edit after
>spend architect time on judgment, builder time on typing
>the architect is the edge and the builder is the hands. the repo is the brain.. think of it that way..
bookmark this. you will need it.. you really wont need to pay hundreds in API tokens if you do this way