Xinming Tu @TuXinming - Twitter Profile

Pinned Tweet

2 months ago

1/6 Hot take: Scaffold is Structured Test-Time Scaling. Every agent scaffold you've ever built — sub-agent spawning, file-based context passing, test-then-retry loops — is doing structured test-time scaling. And soon, the scaffold becomes the model, the model becomes the scaffold. We wrote a unified framework to explain why it works. w/ @guanghao_ye Check more detail: https://t.co/9Qq8dV90cY

8

225

26

267

29K

TuXinming retweeted

Alan Murphy @Al_Murphy_

about 2 hours ago

1/ We built the Genomics × AI blog so the genomics + ML community can share work fast and actually discuss it — incremental results, negative results, tutorials — without waiting on a publisher. Posts are live, more landing over the coming weeks: https://t.co/a4YY6avrda

1

6

2

0

196

Xinming Tu

@TuXinming

about 20 hours ago

Very insightful post. A lot of bio foundation model researchers may not have this insight — worth a read

Bauer LeSavage

@bauer_lesavage

1 day ago

The training data market has exploded for LLMs and bio foundation models are next. But biological data is extremely complex and requires a data generation playbook that prioritizes quality over immediate scale. @_DimensionCap Research article live now! https://t.co/3CqSYADLJP

bauer_lesavage's tweet photo. The training data market has exploded for LLMs and bio foundation models are next.

But biological data is extremely complex and requires a data generation playbook that prioritizes quality over immediate scale.

@_DimensionCap Research article live now!

https://t.co/3CqSYADLJP https://t.co/oZt7rsW1OS

8

95

15

81

21K

0

4

0

2

881

TuXinming retweeted

Romain Lopez

@_romain_lopez_

3 days ago

🚀 We are introducing PerturbPair (with @TakaKud0) — a platform that combines parallel Perturb-seq and optical pooled screening (OPS/PerturbView) in primary cells to systematically map at massive scale how genetic perturbations reshape cellular states across modalities. With wonderful collaborators @TakaKud0, @AnaMeireles, @AntRios, @jchuetter, @MinOta, @ORozenblattRosen, @LeviAGarraway, @KGeiger, @avtarsingh, @jkpritch, and Aviv Regev. Paper link: https://t.co/fnSUymW95s

_romain_lopez_'s tweet photo. 🚀 We are introducing PerturbPair (with @TakaKud0) — a platform that combines parallel Perturb-seq and optical pooled screening (OPS/PerturbView) in primary cells to systematically map at massive scale how genetic perturbations reshape cellular states across modalities.
With wonderful collaborators @TakaKud0, @AnaMeireles, @AntRios, @jchuetter, @MinOta, @ORozenblattRosen, @LeviAGarraway, @KGeiger, @avtarsingh, @jkpritch, and Aviv Regev.

Paper link: https://t.co/fnSUymW95s

2

190

38

101

12K

Who to follow

evo-devo

@Xiaojie_Qiu

Assistant Prof. @ Stanford BASE, Genetics & CS (courtesy). Lead the predictive genomics lab. Come to work with us to build the first AI virtual embryo model!

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

Jian Ma

@jmuiuc

Ray and Stephanie Lane Professor of Computational Biology at CMU School of Computer Science @CMUCompBio @SCSatCMU @CarnegieMellon

TuXinming retweeted

Peter Koo @pkoo562

4 days ago

The historic 90th CSH Symposium on AI in Biology at @CSHL has come to an end. After a decade of remarkable advances in AI x Bio, aspirations have become bolder than ever. The future looks bright but much work remains to turn hype into real-world progress. Thanks to all speakers!

4

141

18

22

15K

Xinming Tu

@TuXinming

5 days ago

Also shouting out @zhou_jian who did a lot of pioneering work on sequence-to-function models. DeepSEA is one of the earliest papers in this field, and he continues to make great contributions to the space

Anshul Kundaje @anshulkundaje

5 days ago

Just want to give a shout-out to David Kelley @drklly who I think often does not get the credit he deserves (outside our core community). I want to highlight why I think he is such a fantastic scientist and leader in regulatory genomics. 1/

7

264

32

90

46K

1

29

6

7K

TuXinming retweeted

Anshul Kundaje @anshulkundaje

5 days ago

Just want to give a shout-out to David Kelley @drklly who I think often does not get the credit he deserves (outside our core community). I want to highlight why I think he is such a fantastic scientist and leader in regulatory genomics. 1/

7

264

32

90

46K

Xinming Tu

@TuXinming

5 days ago

Huge congratulations on both incredible milestones, Prof @weijie444 A well-deserved promotion to Full Professor, and a great win for OpenAI. Highly anticipating the new breakthrough models to come! 🎉

Weijie Su

@weijie444

6 days ago

Personal update: I've joined OpenAI while on leave from Wharton. After a decade away, glad to be back in the Bay Area and train AI models here! One more thing, I've been promoted to full professor, a decade-long journey made possible by many, especially my students.

weijie444's tweet photo. Personal update: I've joined OpenAI while on leave from Wharton. After a decade away, glad to be back in the Bay Area and train AI models here!

One more thing, I've been promoted to full professor, a decade-long journey made possible by many, especially my students. https://t.co/lSbCABlwlY

95

2K

99

224

626K

0

1

0

1K

TuXinming retweeted

Surya Ganguli

@SuryaGanguli

6 days ago

haha - my slide at the 90th @CSHLaboratory Symposium on AI in biology got a spontaneous applause. It was inspired by Sir Michael Atiyah's quote about algebra and geometry: "Algebra is the offer made by the devil to a mathematician. The devil says, 'I will give you this powerful machine...all you need to do is give up your soul: give up geometry...’ The danger to our soul is there... when you pass into algebraic calculation,...you stop thinking geometrically, you stop thinking about meaning." In the slide below: algebra -> AI mathematician -> biologist geometry -> understanding

4

121

18

43

12K

Xinming Tu

@TuXinming

6 days ago · Cold Spring Harbor

Best slide at the 90th CSHL Symposium "AI in Biology" by @SuryaGanguli Really nice talk on NeuroAI! Thanks @pkoo562 for organizing! @cshlmeetings

TuXinming's tweet photo. Best slide at the 90th CSHL Symposium "AI in Biology" by @SuryaGanguli
Really nice talk on NeuroAI!

Thanks @pkoo562 for organizing! @cshlmeetings https://t.co/ip84eimcyT

0

77

15

28

18K

Xinming Tu

@TuXinming

8 days ago

Try ESMFold2 using natural language! @phylo_bio we take care of everything

Phylo @phylo_bio

8 days ago

ESMC-6B, ESMFold2, and ESMFold2-Fast are now live on Biomni Lab. Through our collaboration with @biohub, researchers can now ask Biomni to: → generate protein embeddings → run zero-shot variant effect prediction → predict structures from sequences → compare variants against structural and disease-relevant context https://t.co/3lOQsjtkla Start using these models on Biomni Lab’s HPC cluster today!

3

105

18

39

13K

0

16

3

2

2K

Xinming Tu

@TuXinming

13 days ago

Great benchmark on a highly important topic! Do you plan to test Codex + GPT-5.5 (XHigh) and Claude Code + Opus 4.7 (Max Thinking) next? @niloofar_mire

Niloofar

@niloofar_mire

13 days ago

🧬New agentic AI-for-science benchmark: SMDD-Bench! Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets. We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely. Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time. 🧵 below

niloofar_mire's tweet photo. 🧬New agentic AI-for-science benchmark: SMDD-Bench!

Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets.

We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely.

Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time.

🧵 below

12

175

27

80

23K

0

9

0

7

2K

Xinming Tu

@TuXinming

13 days ago

Curious what your experience would be with https://t.co/7YsgkndQ91 — give it a try! @boxiangliu

Boxiang Liu

@boxiangliu

14 days ago

I tried the Edison system developed by the first team (futurehouse) and it was barely usable. Perhaps they have improved but it would be a long way from replacing scientists.

12

113

7

46

42K

0

742

Xinming Tu

@TuXinming

16 days ago

@JunMa_AI4Health I am using the papers https://t.co/xrZmWoebFN

1

3

191

Xinming Tu

@TuXinming

16 days ago

Very interesting. Three papers on agentic Science were just published in Nature today. Been building in this exact space for a while now—my PhD thesis title is literally: "From AI for Genomics to Agentic Science." 🧐 More to come. 🚀

TuXinming's tweet photo. Very interesting. Three papers on agentic Science were just published in Nature today.

Been building in this exact space for a while now—my PhD thesis title is literally: "From AI for Genomics to Agentic Science." 🧐

More to come. 🚀 https://t.co/SZGKrHiEGO

2

76

7

33

7K

Xinming Tu

@TuXinming

16 days ago

@lecong @CSHL Looking forward to your talk!

0

1

0

100

Xinming Tu

@TuXinming

17 days ago

AI agents will fundamentally change how we do science—but first, we need to understand and measure their true capabilities. 🧬🤖 So proud to release BiomniBench! We @phylo_bio worked directly with domain experts and original authors to build 100 real-world data analysis tasks from leading journals like Nature, Science, and Cell. We’ve released the full public set (tasks + expert rubrics) on HuggingFace for the community. Super excited for the future of agentic science! 👇

Yuanhao Qu

@YuanhaoQ

17 days ago

𝗖𝗮𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗯𝗶𝗼𝗺𝗲𝗱𝗶𝗰𝗮𝗹 𝗱𝗮𝘁𝗮 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘁𝗮𝘀𝗸𝘀 𝗯𝗲𝗵𝗶𝗻𝗱 𝗽𝗮𝗽𝗲𝗿𝘀 𝗶𝗻 𝗡𝗮𝘁𝘂𝗿𝗲, 𝗖𝗲𝗹𝗹, 𝗮𝗻𝗱 𝗦𝗰𝗶𝗲𝗻𝗰𝗲? To find out, we built 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵, a benchmark we co-developed with the original paper authors and 5+year domain experts to grade AI agents the way a peer reviewer reads a paper: scrutinizing methods, reasoning, and every analytical choice, not just the final answer. As the first track of this benchmark, 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵-𝗗𝗮𝘁𝗮𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 contains 100 data-analysis tasks drawn directly from 21 published studies in Nature, Cell, Science, Nature Medicine, and other leading journals. Each task hands the agent a real dataset and a research question, then scores its full analytical trajectory against an expert-authored rubric. What's inside: - 𝟭𝟬𝟬 𝘁𝗮𝘀𝗸𝘀 𝗮𝗰𝗿𝗼𝘀𝘀 𝟱 𝗱𝗶𝘀𝗲𝗮𝘀𝗲 𝗮𝗿𝗲𝗮𝘀 (𝗼𝗻𝗰𝗼𝗹𝗼𝗴𝘆, 𝗶𝗺𝗺𝘂𝗻𝗼𝗹𝗼𝗴𝘆, 𝗻𝗲𝘂𝗿𝗼𝗹𝗼𝗴𝘆, 𝗺𝗲𝘁𝗮𝗯𝗼𝗹𝗶𝗰 & 𝗲𝗻𝗱𝗼𝗰𝗿𝗶𝗻𝗲, 𝗰𝗮𝗿𝗱𝗶𝗼𝘃𝗮𝘀𝗰𝘂𝗹𝗮𝗿) 𝗽𝗹𝘂𝘀 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗯𝗶𝗼𝗹𝗼𝗴𝘆 - 𝟭𝟳 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝗮𝗹 𝘁𝗮𝘀𝗸 𝘁𝘆𝗽𝗲𝘀 (𝗲.𝗴., 𝗚𝗪𝗔𝗦/𝗲𝗤𝗧𝗟 𝗰𝗼𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻, 𝗧-𝗰𝗲𝗹𝗹 𝗿𝗲𝗰𝗲𝗽𝘁𝗼𝗿 𝗿𝗲𝗽𝗲𝗿𝘁𝗼𝗶𝗿𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀, 𝗰𝗲𝗹𝗹-𝗰𝗲𝗹𝗹 𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗶𝗼𝗻) - 𝗔𝗻 𝗲𝘅𝗽𝗲𝗿𝘁-𝗰𝘂𝗿𝗮𝘁𝗲𝗱 𝗿𝘂𝗯𝗿𝗶𝗰 𝗳𝗼𝗿 𝗲𝘃𝗲𝗿𝘆 𝘁𝗮𝘀𝗸, 𝘀𝗰𝗼𝗿𝗶𝗻𝗴 𝟲 𝗱𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 𝗼𝗳 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝗮𝗹 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 - 𝗣𝗿𝗼𝗰𝗲𝘀𝘀-𝗹𝗲𝘃𝗲𝗹 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝟵 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗟𝗟𝗠𝘀 (𝗚𝗣𝗧-𝟱.𝟱, 𝗖𝗹𝗮𝘂𝗱𝗲 𝗢𝗽𝘂𝘀 𝟰.𝟳, 𝗮𝗺𝗼𝗻𝗴 𝗼𝘁𝗵𝗲𝗿𝘀) 𝗮𝗰𝗿𝗼𝘀𝘀 𝟰 𝗮𝗴𝗲𝗻𝘁 𝗵𝗮𝗿𝗻𝗲𝘀𝘀𝗲𝘀 (𝗖𝗹𝗮𝘂𝗱𝗲 𝗖𝗼𝗱𝗲, 𝗖𝗼𝗱𝗲𝘅 𝗖𝗟𝗜, 𝗧𝗲𝗿𝗺𝗶𝗻𝘂𝘀-𝟮, 𝗚𝗲𝗺𝗶𝗻𝗶 𝗖𝗟𝗜) Headline results: - 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗹𝗲𝗮𝗱 𝗮𝘁 𝟳𝟯.𝟯/𝟭𝟬𝟬, 𝘄𝗶𝘁𝗵 𝘀𝘂𝗯𝘀𝘁𝗮𝗻𝘁𝗶𝗮𝗹 𝗵𝗲𝗮𝗱𝗿𝗼𝗼𝗺 𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲. - 𝗧𝗵𝗲 𝗮𝗴𝗲𝗻𝘁 𝗵𝗮𝗿𝗻𝗲𝘀𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗮𝘀 𝘁𝗵𝗲 𝗯𝗮𝘀𝗲 𝗺𝗼𝗱𝗲𝗹. - 𝗔𝗴𝗲𝗻𝘁𝘀 𝗳𝗮𝗹𝗹 𝘀𝗵𝗼𝗿𝘁 𝗼𝗻 𝗯𝗶𝗼𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝘁𝗶𝗼𝗻, 𝗺𝗲𝘁𝗵𝗼𝗱 𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝗳𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴. We hope to make 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵 the most helpful benchmark for biologists to understand how AI agents handle real-world biomedical tasks: where they can be trusted, and where they fall short. We're actively expanding our evaluation effort, and would love to engage the broader scientific community on what comes next. 📄 https://t.co/U1rh2QP9Ht 🤗 https://t.co/513AX8AQtJ Thanks to our amazing @phylo_bio team (Minta Lu, @TuXinming , @serena2z , @TianweiShe , @lecong , @jure , @KexinHuang5 ) and our collaborators at @LaudeInstitute , @Stanford , @Harvard , @PKU1898 , @virginia_tech , Humanlaya Data Lab, Xbench: @alexgshaw , JOU-HO SHIH, Bingqing Zhao, Minjie Shen, Haochen Yang, Jielin Yan, Rongchuan Zhang, Xinze Wu, Tingting Li, Xiaobo Hu, Yuan Jiang, Jiayun Dong, Tao Peng.

YuanhaoQ's tweet photo. 𝗖𝗮𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗯𝗶𝗼𝗺𝗲𝗱𝗶𝗰𝗮𝗹 𝗱𝗮𝘁𝗮 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘁𝗮𝘀𝗸𝘀 𝗯𝗲𝗵𝗶𝗻𝗱 𝗽𝗮𝗽𝗲𝗿𝘀 𝗶𝗻 𝗡𝗮𝘁𝘂𝗿𝗲, 𝗖𝗲𝗹𝗹, 𝗮𝗻𝗱 𝗦𝗰𝗶𝗲𝗻𝗰𝗲?

To find out, we built 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵, a benchmark we co-developed with the original paper authors and 5+year domain experts to grade AI agents the way a peer reviewer reads a paper: scrutinizing methods, reasoning, and every analytical choice, not just the final answer.

As the first track of this benchmark, 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵-𝗗𝗮𝘁𝗮𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 contains 100 data-analysis tasks drawn directly from 21 published studies in Nature, Cell, Science, Nature Medicine, and other leading journals. Each task hands the agent a real dataset and a research question, then scores its full analytical trajectory against an expert-authored rubric.

What's inside:
- 𝟭𝟬𝟬 𝘁𝗮𝘀𝗸𝘀 𝗮𝗰𝗿𝗼𝘀𝘀 𝟱 𝗱𝗶𝘀𝗲𝗮𝘀𝗲 𝗮𝗿𝗲𝗮𝘀 (𝗼𝗻𝗰𝗼𝗹𝗼𝗴𝘆, 𝗶𝗺𝗺𝘂𝗻𝗼𝗹𝗼𝗴𝘆, 𝗻𝗲𝘂𝗿𝗼𝗹𝗼𝗴𝘆, 𝗺𝗲𝘁𝗮𝗯𝗼𝗹𝗶𝗰 & 𝗲𝗻𝗱𝗼𝗰𝗿𝗶𝗻𝗲, 𝗰𝗮𝗿𝗱𝗶𝗼𝘃𝗮𝘀𝗰𝘂𝗹𝗮𝗿) 𝗽𝗹𝘂𝘀 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗯𝗶𝗼𝗹𝗼𝗴𝘆
- 𝟭𝟳 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝗮𝗹 𝘁𝗮𝘀𝗸 𝘁𝘆𝗽𝗲𝘀 (𝗲.𝗴., 𝗚𝗪𝗔𝗦/𝗲𝗤𝗧𝗟 𝗰𝗼𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻, 𝗧-𝗰𝗲𝗹𝗹 𝗿𝗲𝗰𝗲𝗽𝘁𝗼𝗿 𝗿𝗲𝗽𝗲𝗿𝘁𝗼𝗶𝗿𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀, 𝗰𝗲𝗹𝗹-𝗰𝗲𝗹𝗹 𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗶𝗼𝗻)
- 𝗔𝗻 𝗲𝘅𝗽𝗲𝗿𝘁-𝗰𝘂𝗿𝗮𝘁𝗲𝗱 𝗿𝘂𝗯𝗿𝗶𝗰 𝗳𝗼𝗿 𝗲𝘃𝗲𝗿𝘆 𝘁𝗮𝘀𝗸, 𝘀𝗰𝗼𝗿𝗶𝗻𝗴 𝟲 𝗱𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 𝗼𝗳 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝗮𝗹 𝗾𝘂𝗮𝗹𝗶𝘁𝘆
- 𝗣𝗿𝗼𝗰𝗲𝘀𝘀-𝗹𝗲𝘃𝗲𝗹 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝟵 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗟𝗟𝗠𝘀 (𝗚𝗣𝗧-𝟱.𝟱, 𝗖𝗹𝗮𝘂𝗱𝗲 𝗢𝗽𝘂𝘀 𝟰.𝟳, 𝗮𝗺𝗼𝗻𝗴 𝗼𝘁𝗵𝗲𝗿𝘀) 𝗮𝗰𝗿𝗼𝘀𝘀 𝟰 𝗮𝗴𝗲𝗻𝘁 𝗵𝗮𝗿𝗻𝗲𝘀𝘀𝗲𝘀 (𝗖𝗹𝗮𝘂𝗱𝗲 𝗖𝗼𝗱𝗲, 𝗖𝗼𝗱𝗲𝘅 𝗖𝗟𝗜, 𝗧𝗲𝗿𝗺𝗶𝗻𝘂𝘀-𝟮, 𝗚𝗲𝗺𝗶𝗻𝗶 𝗖𝗟𝗜)

Headline results:
- 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗹𝗲𝗮𝗱 𝗮𝘁 𝟳𝟯.𝟯/𝟭𝟬𝟬, 𝘄𝗶𝘁𝗵 𝘀𝘂𝗯𝘀𝘁𝗮𝗻𝘁𝗶𝗮𝗹 𝗵𝗲𝗮𝗱𝗿𝗼𝗼𝗺 𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲.
- 𝗧𝗵𝗲 𝗮𝗴𝗲𝗻𝘁 𝗵𝗮𝗿𝗻𝗲𝘀𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗮𝘀 𝘁𝗵𝗲 𝗯𝗮𝘀𝗲 𝗺𝗼𝗱𝗲𝗹.
- 𝗔𝗴𝗲𝗻𝘁𝘀 𝗳𝗮𝗹𝗹 𝘀𝗵𝗼𝗿𝘁 𝗼𝗻 𝗯𝗶𝗼𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝘁𝗶𝗼𝗻, 𝗺𝗲𝘁𝗵𝗼𝗱 𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝗳𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴.

We hope to make 𝗕𝗶𝗼𝗺𝗻𝗶𝗕𝗲𝗻𝗰𝗵 the most helpful benchmark for biologists to understand how AI agents handle real-world biomedical tasks: where they can be trusted, and where they fall short. We're actively expanding our evaluation effort, and would love to engage the broader scientific community on what comes next.

📄 https://t.co/U1rh2QP9Ht
🤗 https://t.co/513AX8AQtJ

Thanks to our amazing @phylo_bio team (Minta Lu, @TuXinming , @serena2z , @TianweiShe , @lecong , @jure , @KexinHuang5 ) and our collaborators at @LaudeInstitute , @Stanford , @Harvard , @PKU1898 , @virginia_tech , Humanlaya Data Lab, Xbench: @alexgshaw , JOU-HO SHIH, Bingqing Zhao, Minjie Shen, Haochen Yang, Jielin Yan, Rongchuan Zhang, Xinze Wu, Tingting Li, Xiaobo Hu, Yuan Jiang, Jiayun Dong, Tao Peng.

16

369

65

95

34K

0

51

1

18

6K

TuXinming retweeted

Phylo @phylo_bio

22 days ago

Until now, AI agent ran on one machine, and one workflow at a time. A memory-heavy step crashed six hours in, and the run started over. Biomni agent now autonomously decides how many machines it needs, how much CPU and memory each task requires, spins them up, and distributes work across them. It operates like a coordinated team working in parallel, rather than one researcher at a work station. This is the foundation for agent-managed infrastructure.

5

51

6

27

10K

TuXinming retweeted

Kilian Lieret @KLieret

23 days ago

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

KLieret's tweet photo. The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵 https://t.co/ukMHUXYs1O

35

1K

120

297

239K

TuXinming retweeted

Yiping Wang

@ypwang61

28 days ago

We improve a 32-year lower bound in a challenging open problem, Ramsey numbers, through simply scaling autoresearch. ⭕ Proves R(3,17) >= 93. Previous 92 bound were obtained in 1994. Google’s AlphaEvolve (2026) matched previous result but did not beat it. All could be done with Claude Code / Codex + a CPU server. Graphs and evolving history are available at https://t.co/2kCsk9Otur [1/n]

11

323

49

113

53K

Xinming Tu

@TuXinming

28 days ago

@ChenhaoTan I ran openaireview locally and it caught some tiny but important mistakes in writing. So your actual user count is higher than the website traffic suggests!

1

0

238

Xinming Tu

@TuXinming

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users