🔥 AutoResearchClaw tech report + v0.5.0 just dropped.
12,300+⭐ on GitHub. Two big additions this release:
🧪 1/ Domain-Expert Agents in the experiment stage: Specialized agents for high-energy physics, biology, and more. Real domain tools + knowledge plugged in — not a generic LLM pretending to run experiments.
📊 2/ ARC-Bench
A 55-topic benchmark across ML, HEP, quantum physics, biology, and statistics. One of the broadest cross-disciplinary evaluations for autonomous research ever released.
🏆 The numbers:
→ Beats AI Scientist v2 by 54.7% on ARC-Bench
→ 7-mode HITL (human-in-the-loop) ablation: targeted intervention > full autonomy OR exhaustive oversight.
The thesis (still): real research isn't a pipeline. Hypotheses fail. Lessons compound. AutoResearchClaw is a research amplifier — not a paper generator.
📄 Tech report: https://t.co/e5FrGJLzrD
💻 Code: https://t.co/KLOcnzFYaD
Thanks @itsJiaqiLiu and @StephenQS0710 who lead the work and all other contributors @HaonianJi, @lillianwei423, @XinyeYee, @richardxp888, @HaoqinT, @Xinyu2ML, @WeitongZhang, @jiahengzhang96, @LINJIEFUN, @linjunz_stat, @yuyinzhou_cs, @CaimingXiong, @james_y_zou, @ZhengBerkeley, @cihangxie, @dingmyu
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves.
🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat.
🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench.
🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package.
📄 Paper: https://t.co/BWCXebWhG1
💻 Code: https://t.co/hhdgvVjblP
Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie
🔥New paper: Omni-SimpleMem
🧠Multimodal lifelong memory for AI agents — text, image, audio & video.
📈Results:
🏆LoCoMo F1: +57% over Mem0 / Claude-Mem
🏆Mem-Gallery F1: +165% over Mem0 / Claude-Mem
⚡ 3.5x faster retrieval
🔬 How it was built: AutoResearchClaw's Human-in-the-Loop Co-Pilot mode:
🧑🔬 Humans set the research direction
🤖 AI agents ran ~50 experiments autonomously
🐛 Found bugs worth +175% F1
🏗️ Redesigned architecture
✍️ Optimized prompts humans missed
📄Paper: https://t.co/B7JlpgcUTp
💻Code: https://t.co/9bWRtZxsKH
Led by @JiaqiLiu835914, and nice work w/ @YanqingLiu83931, @StephenQS0710, @lillianwei423, @richardxp888, @HaoqinT, @ZhengBerkeley, @cihangxie, @dingmyu
🚀 Introducing AutoHarness (「Aha」) — automated harness engineering for AI agents.
In LLM training, the aha moment is when a model learns to reason. For agents, it's when a better harness makes the same model shine.
Agent = Model + Harness. The model reasons. The harness does everything else:
🧠 Context management
🛡️ Tool governance
💰 Cost control
👁️ Observability
💾 Session persistence
These are the patterns that separate a toy from a system. AutoHarness automates this entire layer.
🔧 What's inside:
- 6-step tool pipeline: parse → classify → permit → execute → sanitize → audit
- 3 modes (Core / Standard / Enhanced) — from lightweight to full-featured
- Smart context management with token budgeting and multi-layer compression
- Full observability: per-call cost tracking, JSONL audit trail, trace diagnostics
- Multi-agent profiles with role-based permissions
- Any LLM provider
Every agent deserves its aha moment.
Led by @JiaqiLiu835914, and Kudos to the team @XinyeYee, @richardxp888, @lillianwei423, @HaoqinT@Xinyu2ML, @yuyinzhou_cs, @ZhengBerkeley, @dingmyu, @cihangxie, etc.