This is the 6th RoboPhD application, alongside @Chudbrochil's recent sudoku work. RoboPhD wins 5 of 6 across the suite — ARC-AGI, Text2SQL, financial QA, sudoku, and now bioinformatics.
Joint work with Anthony and @steve_ash.
I last took biology in high school. RoboPhD just evolved a 682-line agent that scores 65.9% Fmax on Price-149 — a 149-protein benchmark built specifically to defeat homology-based function prediction (the "find a similar protein and copy its labels" approach).
GEPA scored 55.7%. @karpathy's Autoresearch scored 57.7%.
https://t.co/4VKbCQTEiY
RoboPhD finds these techniques because the evolution loop sees rich per-protein error reports and gets selection pressure from head-to-head competition between candidate agents. Bad design choices fail visibly on specific proteins; good ones win iteration after iteration.
The seed was 50 lines of basic BLAST-and-LLM-fallback (very basic stuff). The evolved agent does multi-source evidence fusion, adversarial dual-LLM ensembling, and confidence calibration by homology consensus.
These are techniques real bioinformatics groups publish papers on.
All three algorithms had the same evaluation budget and were given the same information. None of them benefited from my 9th grade bioinformatics wizardry.
Excited to share RoboPhD! An evolutionary approach to optimizing Agents through multi-round competition using Elo.
https://t.co/D7QDGNXD40
https://t.co/Pyaav6R3Mz
Takeaways:
💡On three out of four diverse tasks (abstract reasoning, SQL generation, financial QA, cloud scheduling) RoboPhd beats the popular GEPA and an adaptation of @karpathy AutoResearch Hill climbing approach under the same fixed number of evaluations.
💡 RoboPhd uses a multi-round competition with different sampling each round, using Elo as a means to rank candidates. This allows us to be more sample efficient over a fixed train/validation split.
💡RoboPhd allows the agents to self-instrument to discover useful diagnostic info to surface to the evolution process, kind of a self-adapting textual gradient
The code is out on GitHub under MIT license and we offer a GEPA optimize_anything-like API to make it easy to plug your own tasks! "If you can benchmark it, RoboPhD can optimize it" :)
This work was lead by the herculean efforts of Andrew Borthwick with myself and Anthony Galczak contributing.
@sir4K_zen@rohanpaul_ai Evolution produces an easily deployable agent: the BIRD referees test on systems which are unseen to us scientists.
Regarding schema drift: the Python analysis tools inspect the current schema at runtime. So if the schema changes, just re-run the analysis phase.
@helderbuilds@rohanpaul_ai BIRD benchmark has gold SQL for each question - we execute RoboPhD's queries and score as correct/incorrect based on whether the result matches. The evolution AI analyzes errors and produces revised prompts + analysis code for the next generation
RoboPhD shows LLMs, text-generating AI models, can self-improve text-to-SQL by evolving tools and prompts from feedback.
Text-to-SQL is hard because the model must understand a database's tables and columns, then write exact SQL, the language databases use for queries, where tiny mistakes count as total failure.
RoboPhD splits the job into 2 evolving parts, a non-AI code script that writes a database cheat sheet, and instructions that guide the LLM to write SQL from that cheat sheet.
An evolution agent, an AI that rewrites the system, keeps making new versions based on what went wrong, tests them on BIRD, a public set of databases and questions, then picks winners with an ELO score, a chess-style rating for head-to-head results.
Starting from a tiny 70-line starting point, the best evolved agent reaches 73.67% accuracy, and the biggest gains show up on cheaper LLMs that normally lag behind.
That matters because the final output is just a reusable script plus instructions, so a lower-cost model can perform like a pricier one in real deployments.
----
Paper Link – arxiv. org/abs/2601.01126
Paper Title: "RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution"
An internal #MachineLearning challenge has fostered a greater sense of community among the company's scientists, says principal scientist Andrew Borthwick. Learn more about Amazon's two-pizza teams and its decentralized approach to science and engineering. https://t.co/gogwc2xVlZ