@ShanuMathew93 I give Claude a loop with /goal and the goal is to come back with a passing grade from Chat GPT 5.5.
Claude calls Chat GPT over and over with his attempts until he passes, and the /goal makes sure that he can't cheat or stop early.
@morganlinton It's especially helpful cross model but even cross session with same model the agent / critic is extremely underrated.
The issue is models like their own work just like humans do. You need a fresh session that didn't do the work to evaluate.
@morganlinton Would suggest Claude Code terminal and /goal which will have another agent assess it and loop.
I often have Chat GPT 5.5 as a critic of Opus 4.8 work (Opus is smart enough to code this up), and then the 4.8 goalkeeper task is to keep going till Chat GPT approves.
@Gavriel_Cohen@swyx@Barazany Ya..I found it exploring the source too. They have put an enormous effort into maximizing context caching and there is a lot of editing going on to make it happen.
@saurabh_shah2 How aggressively a harness works to minimize token consumption is going to be a major factor.
The is an enormous amount of context editing going on.
For example, the results of tool calls are generally not available to models on subsequent turns and can have a big impact.
@kimmonismus It is only that the release cycles have gotten so fast that very few people can keep up with it.
AI is continuing to diffuse into the workplace, but the average person doesn't have the bandwidth to keep up with what the current state of art is.
@developedbyed I saw something very similar in my tests.
GPT 5.4 is probably the better coder and smarter model, but it is lacking in taste and wants to over achieve on outputs.
@steipete@Cucho The are likely able to optimize cache across sessions ( if everyone is using the same Google harness ) that breaks down once everyone is bringing their own.
@MatthewBerman My guess is that it isn’t OpenClaw/OAuth that gets you banned but rather what OpenClaw does that could get you banned.
This is why Anthropic don’t want to come out and say that OpenClaw is allowed.
Anthropic has low trust in the guardrails.
Somewhat humorous but I think OpenClaw is going to be seen as a marker for the start of the singularity.
We had a language model breakthrough, followed by a reasoning mode breakthrough, and then a recursive AI breakthrough.
Now they can self improve.
@danshipper@every Note, same exact end cost for ARC-AGI tasks, so it could still be cheaper to use Opus.
You are trading more tokens to solve the problem vs more expensive tokens and the cost per token difference is modest.
@Scobleizer@sqs What you are going to need though for professionals is domain experts who think logically and can explain themselves well verbally.
Probably different workers doing this.