Bogdan A. Zagribelnyy, PhD @sumrexromanus - Twitter Profile

4 months ago

Final countdown ⏳ for the Top-K accuracy metric, it should be forgotten forever. Welcome 🧪 ChemCensor 🔎 metric to evaluate single-step retrosynthesis models and LLMs specifically. Read our preprint https://t.co/5cCTnDeQUS and be prepared for new retrosynthesis benchmarks🚀.

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

@biogerontology

4 months ago

Day 13 of #ScienceAIBench! 🧪 Today we are moving from biomedical domain to organic chemistry and specifically single-step retrosynthesis (SSRS). We are assessing how well top-tier LLMs can suggest plausible reactions 🔮 to get a compound from plausible reactants. In contrast to conventional USPTO-50k-test utilizing ground-truth-based Top-K accuracy metric, the proposed new benchmark assess the models outputs for chemical plausibility framework built on broader chemical context within reaction centers and functional groups compatibility mimicking the way how chemists🧑‍🔬 review the reactions for their plausibility. The key data-driven metric for chemical plausibility assessment is ChemCensor, which is a part of URSA (Utilitarian RetroSynthesis Assessment) family of retrosynthesis benchmarks. The brand new URSA-expert-2026 out-of-distribution benchmark set of target molecules is proposed for realistic assessment in real-world medicinal chemistry cases. 📄 Read the ChemCensor for LLMs Preprint: https://t.co/u2Z1UOo5N3 📋 Benchmark Specifications: · Datasets: 📑 USPTO-50k-test: 4972 target molecules from conventional USPTO-50k set for SSRS models evaluation 🔥 URSA-expert-2026: 100 novel synthetically accessible target molecules assessed by experts · Metric: max ChemCensor value, average per target (↑), {Av. PT max CC} · Metric version: ChemCensor-U2, based on publicly available USPTO full set by D.Lowe · Models Evaluated: GPT 5.1, GPT 5.2, Claude Sonnet 4.5, Claude Opus 4.5, DeepSeek 3.2, Gemini 2.5 Flash, Gemini 3 Flash and Grok 4.1 📊 Observed Performance: · OOD leader: Gemini 3 Flash achieved the highest {Av. PT max CC} of 1.82, demonstrating superior plausibility of proposed reactions on the OOD URSA-expert-2026 set. · LLM versions progress: newer LLM versions show substantial progress on both public and OOD sets (GPT 5.2 over 5.1, Gemini 3 over 2.5). · Performance Gap: All models fail to perform at the OOD URSA-expert-2026 benchmark set as successfully as they perform at the well-reported data. Best performing Claude 4.5 Sonnet at USPTO-50k-test is only 3rd best at the OOD benchmark. · Proprietary models win: top-tier proprietary models show much reliable performance rather than open-source models (DeepSeek 3.2). Some open-source models (like Kimi K2) were even not included to the chart due to poor (~0) performance. 🔄 Our daily series continues tomorrow. #ScienceAI #InsilicoBench #MMAI #MMAIGym #DrugDiscovery #Retrosythesis #AIBenchmarks #Biotechnology

1

14

1

928

0

41

Bogdan A. Zagribelnyy, PhD

@sumrexromanus

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users