turns out a lot of you noticed the same thing.
don't expect an official response, so i had my agents build something:
https://t.co/VypF24GVBy
→ independent benchmarks for codex & claude code every 2 days
→ problems sourced from TerminalBench2 (easy ones filtered out)
→ subscribe (free) to get alerted when something gets nerfed
first batch is already running. scores drop in 2 days.
if this gets 3k subscribers i'll keep it running long-term (benchmarks burn a lot of tokens and that's not free).
RT appreciated. let's keep them honest.
@RussekFilm@HistoryBoomer Nonsense. Unless you're in an expensive city, get all your veggies fresh, and don't use Amazon S&S to buy in bulk, anything above $10/day is a luxury e.g. choosing organic for everything. I eat very well (and almost all organic) and average $8.60/d
@RichardHanania Too easy. Sonnet/Opus 4.5-4.7 and GPT 5.2-5 each have tells that were trivial to spot. The real test is to load up agents with your writing samples and style guides to remove their tells (e.g. Opus using space-emdash-space, GPT being autistic)
@hiarun02 "Codex usage limits are shared with other agentic features. This currently includes Codex and ChatGPT for Excel."
Doesn't seem to include ChatGPT web
@sakpo0007_ Interesting how none of these answers are right (historically). Paul, like Jesus (e.g. Matthew 16:28, Mark 9:1, Luke 9:27), thought the end of the world is coming soon—in their lifetime. Just look at the next sentence: 1 Cor 7:29-31. Also 1 Thes 4:13-18 etc.
Can you try fixing the "Agent terminated due to error" first? That'll help the reduce load from users mashing the retry button. Also, I'd love to use Flash more, but I don't enjoy it asking to run a terminal command just to write broken text to a simple .md in the project folder "T�h�i�s� �c�o�l�l�e�c�t�i�o�n�..."
@rstormsf All true. You also have to be guide it around editing its own configs—Sonnet and K2.5 made malformed edits requiring doctor --fix, and I had to make Codex ssh in once to fix what doctor couldn't
@ryancarson@openclaw Yes, but there are many cases where you shouldn't have the main session run routine ops—heartbeat carries the session context and can be token intensive. I find isolated/subagents better for most scheduled jobs