Here's an example of a scalable loop. I think I can sum up how it works with two words: falsification and regularization
Falsification means keep telling the agent to find problems. Don't ask for verification or satisfaction or completion. That is a path to frustration and disappointment.
Regularization means to control complexity. Complexity is the curse of a sequence of local edits. It builds up whether LLMs make them or people make them. Most devs feel physical pain when they see messy code, but not so the LLMs. They need explicit feedback to keep them in line.
https://t.co/VKSKiTi5Vd
I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.
Result: 22 hours, error-free even on hidden tests!
Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.
This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN
I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.
Result: 22 hours, error-free even on hidden tests!
Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.
This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN
I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.
Result: 22 hours, error-free even on hidden tests!
Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.
This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN
I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.
Result: 22 hours, error-free even on hidden tests!
Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.
This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN
oh gosh, this is kind of a big deal
DO NOT ASK YOUR AGENTS TO DO TDD!
i now have empirical evidence that Test Driven Development is harmful for coding agents
what other popular skills do you want me to debunk?
details about the evaluation below 👇
FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.
peter is right.. in other words you have to know how to coordinate and give the right roles to your agents.
for example, i'd bet most people running /goal on codex don't have a system behind it... they're still prompting some big block of text..
the point of /goal is to stop answering and start coordinating well.
/goal "<objective>"
don't give me one answer turn my objective into a multi-agent loop:
>decompose the work
>assign the right agent
>coordinate dependencies
>execute in order
>review against the original goal
>gate before shipping
>save what worked
>report only what matters
this will 10x your output, because you basically will stop prompting big text and start using codex's agents the way they're meant to while everything has its role so your goal gets executed perfectly and big part of this is that you just stop being the bottleneck...
apply this eveyrwhere not just codex btw.. decompose, assign, coordinate, gate, report...claude code, your own agent stack, anything...
Active learning/experimentation. Let the optimizer decide what data it needs. Similarly, let the agent, decide what data it needs.
We need "self-driving lab" ideas right at the very core of everything.
(Also, everyone, please read my book.)
I think a good solution is coworking for small- and medium- sized towns. You "WFH" from a cowork office near home. No long commute to a city. You're around other adults from your geographical community. You can drop your kids off at school in the morning on the way. It would help with community, wfh isolation, and housing costs.
Maybe you move a little further out and your cowork is a converted barn. (Build it, and I will come!)
@MyMoonEnt@steipete Use Cursor Auto mode on the $20 plan. Best value right now. Write good mechanical feedback (linters, tests) and you'll get good results.
@Murph_2@loycense@N0pes@bowtiedbowser@steipete@mosyaseen Agreed. One-off prompting drifts fast. Design loops instead: agent reads VISION.md for persistent direction, proposes work, then hits real guardrails (tests, types, errors) that push back. Iterate until it passes. Less hand-holding, more reliable output.
TikTok PRs. Make the agents predict what code changes you'll like. Then they can update based on which code changes you accept or reject. IOW, build a recommender system where the content is agent-generated PRs.
This solves the latency problem. You wouldn't have to wait for a request to be fulfilled. You'd just get a constant stream of PR's.
TikTok PRs. Make the agents predict what code changes you'll like. Then they can update based on which code changes you accept or reject. IOW, build a recommender system where the content is agent-generated PRs.
This solves the latency problem. You wouldn't have to wait for a request to be fulfilled. You'd just get a constant stream of PR's.