๐๐๐๐ฉ๐ญ๐ข๐ฏ๐ ๐๐ฎ๐ญ๐จ-๐๐๐ซ๐ง๐๐ฌ๐ฌ - New Research: Even frontier LLMs (Opus) powered SOTA self-improving agents ๐๐๐ข๐ฅ๐๐ on real-world task streams (e.g., Prediction Market). With the same ๐๐๐ข๐ฅ๐ฎ๐ซ๐ ๐ฉ๐๐ญ๐ญ๐๐ซ๐ง: ๐ฉ๐๐๐ค ๐๐๐ซ๐ฅ๐ฒ, ๐ญ๐ก๐๐ง ๐๐๐๐ฅ๐ข๐ง๐. A single harness overfits to past patterns.
The problem isn't the LLM. The auto-harness must be adaptive.
We introduce ๐๐๐๐ฉ๐ญ๐ข๐ฏ๐ ๐๐ฎ๐ญ๐จ-๐๐๐ซ๐ง๐๐ฌ๐ฌ: a tree of regime-specific harness branches, with per-task routing at solve time. Same LLMs, same auto-harness machinery โ but the harness now specializes per task instead of compromising across all of them.
Results vs 5 auto-harness baselines + human-designed harness on 3 real benchmarks:
๐ธ PolyBench (5,075 prediction-market tasks): 80.9% vs 50.8%
๐ธ CTF-Dojo (261 security challenges over 8 years): 50.2% vs 45.2%
๐ธ FutureX (503 forecasting tasks ): 49.5% vs 47.5%
When you building a self-improving agent, two natural questions emerge: ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐ฉ๐ซ๐จ๐๐ฎ๐๐ ๐ญ๐ก๐ ๐๐๐ฌ๐ญ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ? ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐๐๐ง๐๐๐ข๐ญ ๐ฆ๐จ๐ฌ๐ญ ๐๐ซ๐จ๐ฆ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ?
Our new paper has counter-intuitive answers to both: they decouple from model capability, in opposite ways.
Q1 (who produces good updates): the updater's base capability barely matters.A 9B model (Qwen3.5) produces harness updates that match Claude Opus 4.6's. Best vs worst evolver gap โค3.1pp.
Q2 (who benefits most): non-monotonic.Mid-tier solvers benefit the most. Strong-tier hits ceiling. Weak-tier benefits LEAST despite the most headroom โ failing at two layers: skill activation (25% vs ~96% for strong) and adherence drift across trajectory (~4x steeper).
Tested across 7 evolver models ร 6 solver agents ร 3 agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench).
Implication: don't pay frontier prices for both halves of the loop. Put capability budget on the agent (solver), not the evolver.
Thanks for sharing our research! Also want to highlight the infra behind this research which made it possible for us to run so much variants. Two more research on self-improving agents built on the same framework releasing coming weeks: online context-to-harness skill compilation, and adaptive auto-harness for long-running deployment. Code, evolved harnesses, and trajectories all releasing through the repo.
Paper: https://t.co/dztN2uJwlc
Repo: https://t.co/thtDN17b6f
Hugging face daily: https://t.co/XLt57l3mrF
๐๐๐๐ฉ๐ญ๐ข๐ฏ๐ ๐๐ฎ๐ญ๐จ-๐๐๐ซ๐ง๐๐ฌ๐ฌ - New Research: Even frontier LLMs (Opus) powered SOTA self-improving agents ๐๐๐ข๐ฅ๐๐ on real-world task streams (e.g., Prediction Market). With the same ๐๐๐ข๐ฅ๐ฎ๐ซ๐ ๐ฉ๐๐ญ๐ญ๐๐ซ๐ง: ๐ฉ๐๐๐ค ๐๐๐ซ๐ฅ๐ฒ, ๐ญ๐ก๐๐ง ๐๐๐๐ฅ๐ข๐ง๐. A single harness overfits to past patterns.
The problem isn't the LLM. The auto-harness must be adaptive.
We introduce ๐๐๐๐ฉ๐ญ๐ข๐ฏ๐ ๐๐ฎ๐ญ๐จ-๐๐๐ซ๐ง๐๐ฌ๐ฌ: a tree of regime-specific harness branches, with per-task routing at solve time. Same LLMs, same auto-harness machinery โ but the harness now specializes per task instead of compromising across all of them.
Results vs 5 auto-harness baselines + human-designed harness on 3 real benchmarks:
๐ธ PolyBench (5,075 prediction-market tasks): 80.9% vs 50.8%
๐ธ CTF-Dojo (261 security challenges over 8 years): 50.2% vs 45.2%
๐ธ FutureX (503 forecasting tasks ): 49.5% vs 47.5%
When you building a self-improving agent, two natural questions emerge: ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐ฉ๐ซ๐จ๐๐ฎ๐๐ ๐ญ๐ก๐ ๐๐๐ฌ๐ญ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ? ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐๐๐ง๐๐๐ข๐ญ ๐ฆ๐จ๐ฌ๐ญ ๐๐ซ๐จ๐ฆ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ?
Our new paper has counter-intuitive answers to both: they decouple from model capability, in opposite ways.
Q1 (who produces good updates): the updater's base capability barely matters.A 9B model (Qwen3.5) produces harness updates that match Claude Opus 4.6's. Best vs worst evolver gap โค3.1pp.
Q2 (who benefits most): non-monotonic.Mid-tier solvers benefit the most. Strong-tier hits ceiling. Weak-tier benefits LEAST despite the most headroom โ failing at two layers: skill activation (25% vs ~96% for strong) and adherence drift across trajectory (~4x steeper).
Tested across 7 evolver models ร 6 solver agents ร 3 agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench).
Implication: don't pay frontier prices for both halves of the loop. Put capability budget on the agent (solver), not the evolver.
Like the harness-updating-vs-benefit paper earlier this week (arxiv:2605.30621), this work was developed on the A-Evolve framework โ shared primitives for large-scale self-evolving agent research. Code, evolved harnesses, branches, and routing traces all releasing through the repo.
One more paper coming this week: online context-to-harness skill compilation.
Paper: https://t.co/SRq0WTNhHR
Repo: https://t.co/thtDN17b6f
Huggingface Daily: https://t.co/UPJhivNQcX
When you building a self-improving agent, two natural questions emerge: ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐ฉ๐ซ๐จ๐๐ฎ๐๐ ๐ญ๐ก๐ ๐๐๐ฌ๐ญ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ? ๐๐: ๐๐ก๐ข๐๐ก ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐๐๐ง๐๐๐ข๐ญ ๐ฆ๐จ๐ฌ๐ญ ๐๐ซ๐จ๐ฆ ๐ก๐๐ซ๐ง๐๐ฌ๐ฌ ๐ฎ๐ฉ๐๐๐ญ๐๐ฌ?
Our new paper has counter-intuitive answers to both: they decouple from model capability, in opposite ways.
Q1 (who produces good updates): the updater's base capability barely matters.A 9B model (Qwen3.5) produces harness updates that match Claude Opus 4.6's. Best vs worst evolver gap โค3.1pp.
Q2 (who benefits most): non-monotonic.Mid-tier solvers benefit the most. Strong-tier hits ceiling. Weak-tier benefits LEAST despite the most headroom โ failing at two layers: skill activation (25% vs ~96% for strong) and adherence drift across trajectory (~4x steeper).
Tested across 7 evolver models ร 6 solver agents ร 3 agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench).
Implication: don't pay frontier prices for both halves of the loop. Put capability budget on the agent (solver), not the evolver.
Launch Post๐งฌ A-Evolve: The PyTorch Moment for Self-evolving AI
Today we at @amazon launch the universal infrastructure that turns any agent into a self-improving SOTA agent โ zero human intervention.
You give it a base agent โ it returns a continuously evolving Top-10 agent.
3 lines of code. 0 hours of manual harness engineering:
๐ข MCP-Atlas โ 79.4% (#1) +3.4pp
๐ต SWE-bench Verified โ 76.8% (~#5) +2.6pp
๐ฃ Terminal-Bench 2.0 โ 76.5% (~#7) +13.0pp
๐ก SkillsBench โ 34.9% (#2) +15.2pp
Thanks @binghe2727@YisiSang@sammyershi@linminhua16 for the contribution!
#AgenticAI #AEvolve #SelfImprovingAgents
@1997yrrr You got it! We have another research which tested self-improving agents on long-running open-ended task streams (prediction markets, CTFs) and even best model like Opus drifts over time. Following to check our findings/solutions, will post tmr or so.
@Yangg40 You are absolutely right, the key failure mode for weak models is fail to load skills or canโt follow the skill instructions, so ideally if you can train models ability on this direction, it can benefits more from the new harness
Thanks for sharing our research! Also want to highlight the infra behind this research which made it possible for us to run so much variants.
Two more research on self-improving agents built on the same framework releasing coming weeks: online context-to-harness skill compilation, and adaptive auto-harness for long-running deployment. Code, evolved harnesses, and trajectories all releasing through the repo.
Paper: https://t.co/dztN2uJwlc
Repo: https://t.co/thtDN17b6f
Hugging face daily: https://t.co/XLt57l3mrF
Canโt agree more. AI provides a very cheap-and-effective way for feedbacks. Something that needs consulting experts and wait for days, now take 5 minutes. Thatโs exactly triggering me to push Self-improving cause I felt I got improved so much by leveraging AI feedbacks. And AI should use that as well.
Ralph forces continuation; doesn't fix that the model simplifies state mid-stream in anticipation of wrap-up. For example, you might propose for model to try idea A which requires changing 3 files ~1000 lines of code. Model tried and failed at first run so it panicked and decided to fall back to idea A- which only changes 200 lines of code. You need an additional layer of verification for this beyond simple re-prompting & ralph loop. And since training a model might require you to do 100x of those changes every day and each has slightly different context and needs to design their own verification layer for continuation. This is more complicated when you tried to scale things up.
.@karpathy starting a new team to scale autoresearch from his single-py-file demo to Claude-tier models.
After developing the scaled version (~10ยณร prior self-improving work), the bottleneck we hit isn't capability โ it's that frontier models are trained to complete-in-context. That becomes the dominant failure mode at scale.
Excited to welcome Andrej to the Pretraining team! He'll be building a team focused on using Claude to accelerate pretraining research itself. I canโt think of anyone better suited to do it โ looking forward to what we build together!
The bias compounds every time the loop must extend across contexts. Real question isn't whether autoresearch works at 630 lines โ it does. It's getting frontier models to sustain research engagement across the time horizons real training cycles require, when their training distribution biases them toward early wrap-up.
Several A-Evolve papers/codes on this releasing over coming weeks - stay tuned!
Repo: https://t.co/vHrOPu1gk4
Memo: https://t.co/ZaTslam19U
Anthropic's @kdqg1 named the phenomenon earlier this year: "agentic laziness" โ models finding "an excuse to stop before finishing the task." Mechanism beneath that observation: training distribution rewards in-context completion. Models optimize for wrap-up tokens over sustained-extension tokens โ even when capability would clearly support real further progress.
At single-file demo scale, this is invisible โ the whole loop fits in one context. At Claude-tier training scale it becomes the dominant failure mode:
โ multi-file refactors crossing context boundaries
โ hardware errors surfacing hours after launch
โ long-chain data processing pipelines
โ multi-day training runs with mid-run analysis
โ late-stage cross-experiment result interpretation