We talk a lot about how important it is to set up self-verification loops. Especially in the age of powerful models that can run for long periods of time, self-verification is a key ingredient that enables the model to run for much longer, delivering a result that is closer to what you intended, so you can do more without having to constantly check in on Claude as it works.
@delba_oliveira gives a great breakdown of what that looks like and why it matters
Some of you noticed limits drained faster in Codex, we root caused it to an optimization that we rolled back that had an impact on cache hit rates when compacting across long running sessions.
We fixed this and have now reset usage limits for all accounts. Enjoy the weekend.
https://t.co/H8VuWhdzJ3 just got a big update!
- Waay faster
- A new deck every single day
- Explore by slide type (intro, pricing, team…)
- No membership, no paywall
Free, forever.
A flow I just tried and LOVED:
1. /grill-with-docs, talking about a new bit of UI
2. Asks me a question I can't answer unless I prototype
3. /prototype
4. Iterate on the prototype, burning tokens freely until we get a good spot
5. /rewind to the question, and select 'summarize' (Claude Code feature), saying 'summarize what we learned from prototyping'
6. Continue the grilling session, retaining the prototype
Smoooooooth
Tons of folks are piling in here saying that AFK agents are a myth.
I have been using them to ship these GitHub repos:
mattpocock/evalite
mattpocock/sandcastle
mattpocock/software-factory (might be public by the time you see this)
Here are a few steps to making this work, and some reality checks.
Definitions
Let's split this into the day shift and the night shift. Day shift is planning/review/QA, night shift is AFK implementation.
Day Shift (part 1)
1. Use /grill-me to align with the AI
2. Use /to-prd and /to-issues to create a PRD (the destination) and implementation steps as separate tickets, which can be grabbed in parallel (the journey)
3. The PRD is a ticket, but it's not an actionable step. You just put the user stories there
This is pure requirements gathering shit, same as it ever was.
Night Shift
1. I run a planner agent which looks at all the tickets and sees what can be worked on now, and what's blocked
2. The planner agent then kicks off multiple agents (sandboxed using Sandcastle, my OSS tool) to implement the code
3. I then have an automated reviewer agent look at the commits produced - one agent per implementation. This checks alignment to the original PRD, as well as code quality
4. These commits end up on branches that get PR'd to main
5. The planner agent runs again until all work has been completed
The review is a crucial step - it's saved me MANY times. I am planning to massively increase the amount of review I do, hopefully with multiple agents.
But guess what - AFK agents sometimes produce bad code. This can happen because of:
a. The original plan was bad because the best solution was something different
b. The original plan was bad because it didn't take into account all the unknown unknowns, and the AI had to make some decisions during the coding session which were bad
c. The plan was good, but the AI just shat the bed (twice, once in the review stage, once during implementation)
d. Your codebase is bad and the feedback loops don't tell the agent if it did a good job or not
So... QA:
Day Shift (part 2)
1. QA all of the branches created
2. Create follow-up issues, potentially editing the original PRD to adjust the destination
This will usually take a long time, often as long as planning. But then you kick off the night shift again.
Once QA is all done, you review the important bits of code manually, usually in PR's. There isn't anything better than the PR UI right now, so that's what we're stuck with.
Wake-up Calls
1. If you let the AI run all night unbounded by planning, it's going to produce shit code
2. Mostly, my loops finish before I go to bed, it's just the night shift catching up to the day shift
3. The only reason I do AFK at all is because it allows me to automate review and totally not give a shit about latency
4. I always run night and day shift in parallel. I can't plan that far ahead (skill issue, probably). I need working code to base my plans from, so I'm aggressively QA-ing stuff that lands