Unfortunately this account has been shadowbanned for 10 days, for no reason. All attempts to request support from X have been ignored. If you can see this and are willing to ping @nikitabier, I would appreciate it. If this account is not restored soon, I will have to leave.
@karpathy@gallabytes It's not obvious how this materially differs from a 'claw'. Can you elaborate on what you think makes this such a giant leap compared to tagging your agent in a Discord or Telegram chat?
While I like omp a lot in concept, I (sadly) still find codex to be more reliable. Even so, I will keep using/trying to propose improvements to omp since I don't want my model and my harness to be bound together.
We still know remarkably little about how to extract maximum performance from current models. The research on this topic seems extremely inconsistent, or at least difficult to interpret cohesively. Nevertheless, this is very cool.
@sudoingX they're going to work closely with USG to make sure it's cleared after the anthropic debacle - which means it's probably going to be nerfed compared to what it could be, but still an improvement
An observation from doing multiple terminal-bench runs is that results can vary significantly even if you change nothing. LLMs just don't work deterministically: if you run a test 5 times, they can fail twice and pass 3 times despite being configured exactly the same way.
Baseline: Pi with GPT-5.5 (medium) scores 70.8% on Terminal Bench 2.1 at a cost of $35.14.
Next: see if we can improve performance by tuning only the system prompt, without increasing cost.
Baseline: Pi with GPT-5.5 (medium) scores 70.8% on Terminal Bench 2.1 at a cost of $35.14.
Next: see if we can improve performance by tuning only the system prompt, without increasing cost.
Today's Day 1: Deep-diving into agent harnesses. Let's find out what it takes to squeeze max performance from frontier models.
Goal: Best quality + speed at the lowest token cost.
Inspired by @usr_bin_roygbiv’s cheerleading, I’m testing with the Pi harness and optimising 3 core dimensions:
1. Context management.
2. Tools.
3. Control logic (loops, workflows, determinism).
The loop is simple:
• Run baseline benchmarks in Pi
• Generate a narrow hypothesis on one dimension
• Test and measure
• Iterate
Hoping to ship a stronger harness + learn a ton along the way. Follow along for updates and results!
What harnesses are you running that I should try learn from? Drop your suggestions below 👇
Next up is ID verification for AI models btw
"...suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees."