What would a project, framework or language need to provide to maximize effectiveness for AI?
I’m thinking: a large corpus of working examples, the ability to establish fast feedback loops via strong types, good testing culture, good linters/formatters, well written user docs and plentiful examples.
My gut says something like https://t.co/SFcL1b5VJX would probably help too.
IME the more guardrails I establish the more effective the LLMs are because there are fewer degrees of freedom for them to get lost in. The faster they can identify when they’ve left the golden path e better the results.
I made GPT lose it's mind in my Codex App.
Its last message (see screenshots) was to generate an image of a Java cheatsheet. I did not ask it to generate an image. I asked it to generate Rust code.
And then it tried to compact over and over and even if I stop and restart or recompact it cannot break out of this loop.
I could watch a whole series of lectures like this.
I find this far more engaging than watching a slide-deck. The imperfect nature, and the natural pauses allows me to focus more on the material and absorb it rather than passively watching a polished presentation with picture-perfect slides.
To be fair, if someone is a 0.01x developer a 100x would bring their productivity to that of an average dev.
I actually think this is what is happening in a lot of cases: non or really bad devs are finally producing a “normal” amount of code and are having the time of their lives.
One thing I do is set up a goal in a file, and then I tell Codex to use the goal file, and have Claude set to wake up every N minutes to review the commits and "steer" Codex by updating the goal file with flags it finds. Codex is also told about the arrangement and to re-check the goal file every time it completes a task.
I literally do: /goal /tmp/goal.md
This way I get Codex and Claude working together, with Claude watching Codex like a hawk. I'll also tell Claude to make sure Codex doesn't reward hack, and tell Codex to be wary of Claude too.
There is definitely a change in cache usage. Around a week ago my 20x started draining 3-5x faster than before; with no change in prompt or project.
I have been using goals with 5.5 low that drain 20-50% of my weekly quota in 24 hours. These are rust projects with long build times in between inference so it's not like it's hammering OpenAI servers or anything.
There are multiple reports all across X confirming this.
A 155h goal is a lot, but that doesn't mean it was all run consecutively. This is just how it reports things.
@zeeg@ThePrimeagen As I read that book a few times I was like “thats a very logical way to break that problem down and solve each sub problem in parallel”.
I was not surprised af all to learn the author is a programmer.
I'm shocked that the ruby community hasn't latched onto mutant in the age of LLMs.
There are no better techniques available to Ruby devs go make their tests better and their agents more effective than mutation testing.
Seems like such a massive waste of time and money to not use it.
@joshmo_dev Agent harnesses should have the option to track every single event in such a way that you could do a perfect reconstruction of a session and what happened.
IMHO the quality of engineering in a harness should be closer to a database than a todo list app 😁
@JacobRothfield@doodlestein Haha, well it's fun challenge, but it's not for a lab.
My harness allows a SOTA model and open weight models to work together. There are a few modes, some where it could be used for distillation, or offloading of some tasks, or best of N, etc.
@JacobRothfield@doodlestein > mitigate LLM failures that are baked in by the training
I know I've seen this but it must be very frequent for you to call this out. What have you seen?
(asking because I am working on a harness right now)
@CWood_sdf Spoiler: All programmers have atrocious memories, and the ones who don't think this applies to them are the worst.
Functional programming was made for people who are honest with themselves about their capabilities.
I would guess they can't move and experiment as fast as third party harness.
Also, I don't think RL is as strong of an anchor as people think. The LLM has no idea what harness it's talking to; all it does is produce tokens and tool calls that the harness parses. There's no reason a third party harness couldn't perfectly emulate the interfaces exactly, but chain and combine things in novel ways.
@DavidKPiano@lgrammel I know this somewhat contradicts with your statement about “go off and come back when you're done” but I'm thinking about it more for microtasks not entire trajectories that add a bunch of context. Just offload the fiddly bits while keeping the mainline context higher level.
Another benefit is that you can hand off tasks like “fix the failing lint or twst” to the forked subagent. The iterating required to make it pass doesn't really add to the conversation, so the subagent can just work on it and then return success without cluttering context.
A lot of what we put in the context window doesn't really benefit the LLM on future turns, it just adds noise.