@mitchellh I do something similar with context windows. Don’t use the 1M models, stick to the 256k variants. If it hits a compaction, decompose and try again.
I think this is better because a small LOC change could still require a lot of background work / verification / thinking.
My heuristic is that any diff an agent generates over ~1500 lines is too big and is indicative that the problem needs to be decomposed. This is my general pattern now for feature work:
1. Try to implement the whole feature, loosely guided. I call this the "draw the owl" prompt in reference to the meme. Expect garbage, you're going to get garbage.
2. If the diff is less than 1500 lines, review it and iterate normally. If the diff is more than 1500 lines, prompt the agent to decompose the problem into atomic, incremental, reviewable tasks. Simultaneously, do this yourself.
3. Agents will very often make these tasks way too specific to the shape they solved. You need to massage it into the right general shape. Do that.
4. Kick off new agents to work on those incremental things (as parallelized as possible). Apply the same rules.
5. At a certain, point, repeat the "draw the owl" prompt. At some point, you will get beneath your review-ability threshold.
This has been producing consistently high quality, maintainable, reviewable chunks of code that have a good handoff to either merge as-is or human refinement.
And with the latest frontier models at xhigh thinking, these are all slow enough that you can usually have multiple going concurrently while you are actively reviewing others or working on your own tasks.
HITL (human-in-the-loop) agents are still super important, especially for feature work. Features touch the human boundary in terms of UI, API, etc. And net new stuff can introduce pathologies in the architecture that violate desired invariants (these should be represented in specs or tests but we aren't perfect!).
I know a lot of the leading edge agentic discourse is about "loops" and agents driving agents continuously. I do some of that (will report on that later). But, in terms of raw daily get-shit-done type of work, this is my most rewarding pattern at the moment.
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
estimating based on my usage rates today, you can run gpt-5.5 in codex continuously for ~40 hours per week on the $100 plan (once the 2x bonus usage ends). that's pretty wild.
I see a lot of people claiming agent workflow wrappers are useless, that modern claude / gpt can handle long running tasks on their own across compactions. Today I tried again with 5.5 and had to throw out the branch.
The task I gave it isn't even that uncommon - remove an abstraction layer in a 3 tier system. Requires some changes at the DB layer, service / API layer, and frontend. Should be under 5k LOC change, probably closer to 3k, mostly removals.
I went into planning mode with 5.5 and defined what I wanted precisely. When it went into implementation it immediately reduced the scope by putting in compat shims on the frontend and DB (in the same context window as the planning!)
I suspect this tendency is rooted in the RL method. These models are so heavily reinforced to solve the problem in a single context window that they start making bad judgements as they approach the end of the context just to get to some state they can classify as a win.
This is where workflow wrappers come in handy - only delegate tasks to agents that can actually be accomplished in a single context window. In my home grown implementation I actually completely throw out the work if it hits compaction and scope it down further.