We evaluated multiple agent setups, including an orchestrated multi-phase agent and off-the-shelf CLI agents, across frontier models.
The headline: scientific reproduction remains hard for today's coding agents.
Check it out here:
🌐 https://t.co/60J04wXySd
Can coding agents actually reproduce scientific findings — not just write code, but rerun the science?
We built AutoMat to find out. 🧪🤖
AutoMat is a benchmark of 85 expert-curated reproduction tasks from computational materials science.