with fable I'm letting claude code again
(I used to be maniacle about never letting claude EVER code, only write plans that go through codex review, and even after a pass, have codex implement)
we'll see.. I'm letting claude / fable write code directly now (but with codex reviewing now)
see how it goes!
like it or not, I think claude's model is still the best for actually interacting with - it can infer what you're saying, suggest designs that match your intent, etc.
but coming with that is anthropic's own built-in tendencies - risk-reducing, safety guards, etc.
so you kind of have to "work with it," like I suspect claude's tendencies to finish only 80-90% of the way (which itself is something claude code's harness had to "re-direct" towards, completion)
so I had to put in some language in claude.md along the lines of "finishing 80-90%, then pausing is a RISK" because incomplete implementations create problems later...
(contrast that with GPT - which has a lot of prompts in codex to tell it to finish... that's why it's so satisfying to work with BUT it's so "autistic" in its direction following, very dev-focused, that I still need to use claude as a layer to "translate" my intent to GPT.)
@naw103 Hah right there’s the cleanup “loop” where I have the agent check todo is completed (or the fixer is reminded to close out, move to completed todo), otherwise it’s check / repro confirm it’s done looking at commits or code, then close it out - the housekeeping “loop”
so I'm just realizing it now... but all my AI-coding workflow has effectively converged into some form of "loop"
when I have an idea or something to fix - I start with claude to investigate, then write up a plan doc
then send to codex for review, and iterate until it gets a pass > loop between claude / codex back and forth
converging findings = design is robust
expanding findings = something is wrong
then once something is ready, I immediately as claude / codex to test it live, as close to how I'd test
which means I also built out like custom dev CLI tools for the agent to use, things like an excel add-in (so it can "see" into the sheet) - or have it use chrome so it can visually verify too and drive via browser if neccesary
so that's loop #2 - immediate feedback (as if the user) - provides bugs / gaps issues, which typically it can address if its minor... or I have it "log" to the TODO
so then its real error > fix / log to TODO > new session picks up the TODO entry
so that's effectively loop #3 - the codebase becomes a working log of issues, that another session is basically "looped" over the TODO, to investigate root cause, propose fix, then live test
then I'm putting together some "QA loops" - so a separate session - again, drives the functionality as if a user (i.e. like an analyst) - to test it out, to again surface frictions, or visual gaps, etc.
then it fixes it as it goes... and continues, but this time from more a judgement or UX type of approach - loop #4
the "loop" itself, I find works better, when I'm at first part driving / part monitoring the agent (I find codex better for loops) - because what I'll catch starts to become the hard rules or boundaries for the loop
like SKILLS are judgement, CODE for deterministic, and you must LIVE test to pass, etc. - see these become the foundation for a successful loop
so I can ultimately see that (assuming these loops start to be coherent) - I stay more and more in the "design / architect" zone... more in the planning areas
and let the AI / workflow handle itself with these key loops + feedback
oh and literally right now I have a codex loop (which itself is following a plan I put together, which includes things like validate, use sub-agent reviewers)
- basically to put together a "positioning' indicator - bottom's up - so looking at individual fund portfolios, mapping to benchmark, assess the relative overweight / underweight - so we get a "global" underweight / overweight PER stock
well that itself is a /goal that's functioning as a loop because it has validation gates, subreviewers
and then I also set up a separate claude session - to monitor the progress and give a review, and then pass messages via agent-mail or with a .json doc
and so now there's two concurrent loops working together
@John_Hempton you gotta start having it "document" things as it goes
then other sessions can pick up the thread, and the file as truth prevents drift in its session too
been experimenting with this new claude + codex workflow... still literally "in-flight"
so I have quite a few "methodology" skills that reflect an equity research workflow:
like identify business model, key driver identification, valuation analysis..., etc.
using MCP tools (data inputs) with a step-by-step methodology, the skills (with agent) produce the judgement
but in order to fully capture that into a workflow that ultimately passes through to a real financial model
there's a lot of "deterministic" pieces, like setting a KPI as a "driver", picking the assumptions, all with a decision-log, citation / evidence tracking, versioning, etc..
course, the LLM ultimately gets overwhelmed when it has to produce all these "typed" outputs with text with the pain to "capture" these back into code + model.
so I had to build the "deterministic" CLI type of package to take the agent judgement > transform to code again.
right now, I have a /goal of one codex session, that's doing the migration - but with a QA type of approach, identifies the parts of the skill to transform to "code steps per the new CLI / functions, one skill at a time, with a "playbook" as guide... so it's a loop
then I also have a few claude sessions also "building / designing" the missing infra pieces (say, we missed a cutover gap of YAML outputs - like decision verdict, etc.)
so it builds the infra, does a skill migration, then updates the playbook...
all while the loop is running - so now the development is more like "build the design / foundation," update the workflow (i.e. the playbook), then the /goal loop starts to implement in real-time
all while another claude monitors progress
similarly, I'm building a "positioning-indicator" which involves a lot of bottom-up build, taking each fund's weights, compare it to the stated benchmark (which itself requires an LLM), which all needs to be validated, tested, etc. each step of the way
again, another /goal - that's based on a workflow plan that involves subagent reviews for code / methodology while building developed
then another "reviewer" - a claude sessions, which itself is on a "loop" to monitor updated slices, write up a review, which the other "loop" is directed to read
so two concurrent loops of implementer + reviewer, outside reviewer - both working towards a clear goal (i.e. a positioning indicator for each stock - are active funds overweight or underweight)
literally doing this in real-time...
interesting way to do things!
"Hey guys what if we just like slowed this stuff down a bit huh wouldn't that be nice" - company that shipped a new feature every day for 3 months straight but is now strangely quiet and kinda flubbed the last model update...
does anyone else absolutely HATE tmux but can't operate WITHOUT IT...
especially the copying and pasting?!
I swear everyday I do the stupid select with cursor to copy and then it has zero feedback that I actually copied it, so I just jam ctrl v over and over
makes me want to punch my monitor and throw my computer out the window
I think I’ve settled on on a proper design for cli / mcp
MCP / tools IN (data input)
CLI / code execution OUT (data transformation / mutation)
Then SKILLS (methodology, process) in the middle
Deterministic inputs for the LLM, with context management embedded in the harness
LLM to exercise the judgement per situation (procedure laid out in the skill)
Then CLI / code execution for the OUTPUT - deterministic again
claude code is theoretically powerful enough to run 1,000s of agents
but I see a "rate-limit' with 5...
there's some disconnect in how the claude devs use it vs. how its offered