Applications are now live!
Cohort 0 starts March 13th in Presidio with OpenHands, OpenRouter, alphaXiv, Fireworks, Dedalus Labs, Franklin Templeton, Founders Fund and Pantera.
→ $25K+ in prizes
→ 3 weeks building state-of-the-art AI agents
→ Many more surprises
Apply below 👇
Harbor integration is live with EvoSkill v.1.2.0
Harbor is a framework for evaluating AI agents against containerized benchmark tasks. It gives EvoSkill access to evolve agents against a registry of 190+ datasets — including benchmarks like SWE-bench Verified, Terminal-Bench 2.0, and Aider Polyglot.
Here’s what it means for automated agent evolution ↓
Harbor integration is live in EvoSkill v1.2.0!
Evolve your agents against a registry of 190+ datasets, including benchmarks like SWE-bench Verified and Terminal-Bench 2.0.
Available on GitHub: https://t.co/gYc4tTQ7pK
skillmaxxing era unlocked.
we built EvoSkill v1 — open-source toolkit that lets AI agents evolve their own skills from failure traces
just give it a benchmark + scoring function and let it cook. voilà.
Introducing EvoSkill V1: an open-source toolkit that evolves any coding agent into a state-of-the-art specialist in minutes.
V1 acts as autoresearch for AI agent skills. Just plug in a benchmark, a ground-truth table (or an LLM judge rubric), and a coding agent, and it evolves the agent against that benchmark.
This is the first production drop from Sentient Labs' AI evolution research, where we're exploring how to make AI self-improve across prompts, skills, memory, and the agent harness itself.
Read more to start evolving ↓
Pi Day is here and it’s open to all builders!
From 1 PM to 8 PM PT we’ll run an open program on grounded reasoning and AI evolution, with talks, discussions, and hands-on building.
Join us in Presidio, San Francisco, for the Arena’s Opening Day.
Check the full list of events 🧵
We are excited to welcome Arena’s Cohort 0.
We’ll be joined by top-tier builders, researchers, and operators from across the ecosystem who will first face Challenge 0: Grounded Reasoning over Large Corpora.
Our objective is to document Cohort 0 findings and open-source them.
We’re aligned with efforts like GEPA’s Labs and Karpathy's autoresearch that proved that open-sourced research compounds faster, and we are happy to provide the platform to forward open-source AI research and developments.
Looking forward to what Cohort 0 can come up with!
Bytedance’s Doubao Phone Assistant launched in December - an AI executing real tasks across apps - but never found traction. OpenClaw, its open-source successor, captured China in under 100 days.
The foundational difference? @openclaw's API routing aligns incentives across chinese AI enterprises without centralizing data.
No single company controls the stack; the data stays local with users.
That’s the open source AI vision we’re building at @SentientAGI.
Challenge 0 for Sentient's Arena is set: “Grounded Reasoning over Large Corpora.”
Economically viable AI solutions, high in demand across developers & enterprises, are centered around grounded reasoning, or the ability to parse, extract, and compute over large bodies of data.
From a technical perspective, grounded reasoning is the composition of several failure-prone subsystems: perception, retrieval, ranking, disambiguation, numerical or symbolic computation, and final answer synthesis. Each step can be locally plausible and still lead to the wrong answer.
That is why this is still a frontier problem.
Frontier models now perform well on many abstract reasoning tests, but grounded tasks remain far from solved.
On OfficeQA, Databricks reports that even the best parsed-page setup only achieves ~70%. SealQA is also far from solved, with GPT-5 failing to pass ~45%.
In fact, many other top benchmarks are actually grounded reasoning benchmarks: BrowseComp, GAIA, APEX-Agents, Fin-RATE, DABstep, and more.
Differentiating superior reasoning solutions to such problems allows us to study valuable reasoning traces that can teach the next generation of AI models how to beat similar tasks with greater ease. In the same vein, abstracting good solutions into skills helps us build a good agentic library of capabilities in the interim.
We are excited to meet Cohort 0 in just a few days to work on this problem together, how it relates to their startups, or how their work with us can help them launch new businesses.