Is LLM ready for real scientific discovery? To find out, we gathered 50+ scientists from 20+ institutions establishing a multi-level evaluation framework: Not only on questions, but also on research scenarios and projects
Current science benchmarks (like GPQA and MMMU) ask AI to answer quizzes. But science isn't a quiz. It’s an iterative loop of hypothesis, experiment, and analysis. Mastery of static, decontextualized questions, even if perfect, does not guarantee readiness to discovery, just as earning straight A’s in coursework does not indicate a great researcher.
Today, we introduce Scientific Discovery Evaluation (SDE): A benchmark grounded in real-world research projects. There, research projects are decomposed into modular research scenarios from which vetted questions are sampled. LLMs are evaluated on
1. Question-level: targeted, expert-written problems embedded in real research scenarios (elucidating structure from NMR, forward reaction prediction, etc.), NOT sub-domains (analytical chemistry, inorganic materials, etc.)
2. Project-level: realistic scientific discovery loops (e.g., molecular design, materials discovery, protein engineering) where models must iteratively propose, test, and refine hypotheses.
With a joint force of 50+ scientists from 20+ institutes, we gathered 8 projects, 43 research scenarios, and 1125 questions. Evaluation on these multiple levels reveals where current models succeed, where they fail, and why.
It is of great joy to work with a 50+ author team in my first time of life - Thanks to you all for making it happen. @hello_jocelynlu, @YuanqiD, @BotaoYu24, @HowieH36226, @rogerluorl18, @YuanhaoQ, @YinkaiW, @Haorui_Wang123, @JeffGuo__, @SherryLixueC, @MengdiWang10, @lecong, @ParshinShojaee@KexinHuang5@chandankreddy, @realadityanandy, @pschwllr, @KulikGroup, @hhsun1, @MoosaviSMohamad, and many others who are not in the x-universe.
Also it’s exciting to see a concurrent release from @OpenAI on FrontierScience yesterday (@MilesKWang)! Their findings on the need for harder, expert-vetted evals, especially the huge performance gap between Olympiad and research questions, echo ours. SDE takes this a step further by moving beyond expert-level Q&A to explicitly evaluate the end-to-end discovery loop with project-level execution, where more finer-grained observations are thereby made possible.
Core Findings Below:
UCLA associate prof. Samanvaya Srivastava (@uclaengineering@cnsiatucla) is co-leading a $7.5M @NSF-funded initiative to revolutionize sustainable chemical manufacturing—enabling cleaner, scalable production of high-value chemicals 🧪
https://t.co/zGEAQnS7Pg
Check out new work by Deb in collaboration with Martin, Margaret Gardel, Aleks Walczak and Thierry Mora : https://t.co/uowzvcOqST on how simple mechanosensitive agents can enable learning mechanisms
Check out new work by Jordan in collaboration with Aaron Dinner and @PVlahovska ! https://t.co/bvjvTo4Dft We consider minimal protocels under non-equilibrium growth conditions and extract low (2) dimensional rules to describe their shapes.
Next week’s entry in the @AI_and_Science Schmidt Fellows Speaker Series features Suri Vaikuntanathan (@suri_lab), Professor at @UChiChemistry!
Join us on Tuesday, February 4th! https://t.co/JJJsgyiTdU
Ligand Many-Body Expansion as a General Approach for Accelerating Transition Metal Complex Discovery
https://t.co/vEfWZprDbG
@realadityanandy@KulikGroup#JCIM Vol64 Issue24 #compchem
We’ve just finished writing the missing 15,616 Wikipedia articles to get complete coverage of all 19,255 human genes. We used PaperQA2, which has higher accuracy than existing human-written Wikipedia articles, as judged by blinded biology PhD students and postdocs. 1/5
Apply for the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship and join a cohort of scholars seeking to advance and accelerate the adoption of artificial intelligence (AI) in the natural sciences and engineering! https://t.co/A2aGcDFvTZ
Thrilled to receive the @NIH Director’s New Innovator Award!🚀 This launches a new research area for us to study dynamic processes at electrified interfaces in biology (e.g., ⚡️🧠 during an action potential) using our novel cryo-EM tools developed for #batteries. #NIHHighRisk