Nicholas Larus-Stone

@nlarusstone

AI for science @benchling

Joined November 2011

1.4K Following

3K Followers

1.4K Posts

Pinned Tweet

Nicholas Larus-Stone @nlarusstone

about 1 year ago

We started @sphinx_bio to empower scientists, and today I'm excited to announce that we'll be continuing that mission as part of @Benchling! There’s never been a better time to build AI tools to help scientists and there's no better place to build those tools than Benchling. Benchling is already used by hundreds of thousands of scientists across the world and we are hard at work building AI agents into the platform to help accelerate research. Keep an eye out for more updates soon!

nlarusstone's tweet photo. We started @sphinx_bio to empower scientists, and today I'm excited to announce that we'll be continuing that mission as part of @Benchling!

There’s never been a better time to build AI tools to help scientists and there's no better place to build those tools than Benchling.

Benchling is already used by hundreds of thousands of scientists across the world and we are hard at work building AI agents into the platform to help accelerate research.

Keep an eye out for more updates soon!

16

107

6

9

12K

Nicholas Larus-Stone @nlarusstone

5 days ago

@lukasweidener Thank you for doing this!

0

2

0

0

402

nlarusstone retweeted

9 days ago

How do the frontier models compare on biosecurity? We’re releasing RefusalBench, an open benchmark by @AppliedSciAI for auditing frontier model refusal accuracy across biological risk tiers. Our goal was to test which frontier models block legitimate research prompts the most often and pinpoint the patterns most likely to trigger a false refusal. We used RefusalBench to test 19 models on the same biological prompts and found a wide gap (94.5 pp) between the least and most restrictive models. • Anthropic models are ~21X more likely to refuse than the non-Anthropic baseline • Grok 4.20 is the best-calibrated model - catching 81.7% of dangerous prompts while refusing 3.0% of benign ones • High refusal rate ≠ high safety - the highest-refusing models aren't the best at catching genuinely dangerous requests - they're just refusing more of everything. You can now test your own orchestrator model with RefusalBench and find which subdomain-tier intersections will silently kill your pipeline before it happens in production. 🧵

lukasweidener's tweet photo. How do the frontier models compare on biosecurity?

We’re releasing RefusalBench, an open benchmark by @AppliedSciAI for auditing frontier model refusal accuracy across biological risk tiers.

Our goal was to test which frontier models block legitimate research prompts the most often and pinpoint the patterns most likely to trigger a false refusal.

We used RefusalBench to test 19 models on the same biological prompts and found a wide gap (94.5 pp) between the least and most restrictive models.

• Anthropic models are ~21X more likely to refuse than the non-Anthropic baseline

• Grok 4.20 is the best-calibrated model - catching 81.7% of dangerous prompts while refusing 3.0% of benign ones

• High refusal rate ≠ high safety - the highest-refusing models aren't the best at catching genuinely dangerous requests - they're just refusing more of everything.

You can now test your own orchestrator model with RefusalBench and find which subdomain-tier intersections will silently kill your pipeline before it happens in production. 🧵

1

19

9

9

6K

Nicholas Larus-Stone @nlarusstone

7 days ago

Craziest part of this to me is that Pi is on par with Codex! Suggests that coding harnesses might not be optimal for science

8 days ago

Introducing SpatialBench-Long, a benchmark for long-horizon spatial biology. Agents must recover biological claims from raw data and realistic experimental context without prescribed methods. 24 evals span primary tumors, organoids, xenograft models, lineage-tracing systems, and aging/intervention biology. The best agents score 11.1%.

kenbwork's tweet photo. Introducing SpatialBench-Long, a benchmark for long-horizon spatial biology. Agents must recover biological claims from raw data and realistic experimental context without prescribed methods.

24 evals span primary tumors, organoids, xenograft models, lineage-tracing systems, and aging/intervention biology. The best agents score 11.1%.

1

67

17

40

8K

0

20

1

6

3K

Who to follow

Elliot Hershberg

Verified account

@ElliotHershberg

Partner @AmplifyPartners, writing https://t.co/LSbi0EEiHU

Verified account

investing in atoms @cantos | community building @decodingbio | @stanfordgsb @JHUbme

Michael Retchin

Verified account

@MichaelRetchin

Building scientific intelligence. past: founding team @cellaritybio, ML researcher @Cornell, investor @pillar_vc, advisor @nucleateHQ

Nicholas Larus-Stone @nlarusstone

11 days ago

nlarusstone's tweet photo. https://t.co/dxHgOKKw8X

0

7

1

2

707

Nicholas Larus-Stone @nlarusstone

12 days ago

@jesse_brodkin Half these companies aren’t AI companies, seems like more people should call Ron

1

1

0

0

283

Nicholas Larus-Stone @nlarusstone

15 days ago

It's incredible how powerful getting multiple teams in the room and forcing them to agree on a data model can be

1

4

0

0

466

Nicholas Larus-Stone @nlarusstone

17 days ago

@srikosuri I seem to remember some Asimov stories with this as the central theme

0

2

0

0

142

Nicholas Larus-Stone @nlarusstone

18 days ago

This is cool! Now imagine this on your own internal data…

18 days ago

GPT 5.5 is an effective autoresearcher in structural biology! I've had goal mode running for over 150 hours straight, looking for topologically inspired architectural changes to improve the performance of AlphaFold2. Performance is strong and improving!

ChrisHayduk's tweet photo. GPT 5.5 is an effective autoresearcher in structural biology!

I've had goal mode running for over 150 hours straight, looking for topologically inspired architectural changes to improve the performance of AlphaFold2.

Performance is strong and improving! https://t.co/ipVVVB7OOd

44

1K

134

717

132K

0

2

0

2

973

Nicholas Larus-Stone @nlarusstone

19 days ago

@ruth_hook_ Which part? The design or the pipetting

0

0

0

0

24

Nicholas Larus-Stone @nlarusstone

21 days ago

What people don’t realize about the Mandate of Heaven is that Heaven can withdraw its mandate at any time

0

0

0

0

120

Nicholas Larus-Stone @nlarusstone

23 days ago

@AnthropicAI out of curiosity what is so dangerous about lorem ipsum??

nlarusstone's tweet photo. @AnthropicAI out of curiosity what is so dangerous about lorem ipsum?? https://t.co/S7M3FJjk0s

1

0

0

0

266

nlarusstone retweeted

23 days ago

@benchling — AI for science role The Company: biotech R&D platform, blue-verified, 10K followers, offices in SF / Boston / Zurich. @nlarusstone (AI for science @benchling) is sourcing. Looking For: someone working on AI applied to biological research. Nick's post is short on role-title detail; treat this as an open conversation if you're an AI engineer who wants the bio-science domain. link here: https://t.co/T2ilANnAva

1

1

1

1

258

nlarusstone retweeted

24 days ago

🎉 Introducing Benchling Biologics: an end-to-end platform for antibody R&D, built for the speed and complexity that scientists need. ✔️ Antibody-aware data model ✔️ No-code configuration for any format ✔️ Automated registration linking proteins, chains, and domains ✔️ Full experimental context across the DBTL cycle The result? AI-ready data from the moment a sequence is created. Available today. https://t.co/6OONjMNsFr

2

43

9

31

5K

Nicholas Larus-Stone @nlarusstone

24 days ago

@amyxlu @ricomnl Say more about this — how do you quantify this compression?

0

0

0

0

99

Nicholas Larus-Stone @nlarusstone

25 days ago

I see this claim a lot but the really interesting question here is what is the smallest version of the PDB that would have allowed us to get alphafold2 level performance?

Shae McLaughlin

26 days ago

It’s estimated that the Protein Data Bank (PDB) cost around $13B to create. Alphafold was only possible because of it. If we want ML to solve biology, we should be funding the creation of databases and the development of new assay technologies. ML is nothing without data.

40

1K

177

266

157K

7

56

4

14

21K

Nicholas Larus-Stone @nlarusstone

24 days ago

@AnthropicAI out of curiosity what's so dangerous about Lorem Ipsum?

nlarusstone's tweet photo. @AnthropicAI out of curiosity what's so dangerous about Lorem Ipsum? https://t.co/RTJ8eiiNl6

0

0

0

0

270

Nicholas Larus-Stone @nlarusstone

24 days ago

@OliH58400344 Shouldn’t we have enough structures though?

0

1

0

0

41

Nicholas Larus-Stone @nlarusstone

25 days ago

What are the other problems in biology that have a huge unstructured piece (e.g. sequences) and a small high quality piece (e.g. structures)?

Nicholas Larus-Stone @nlarusstone

25 days ago

@ricomnl “we find that all models are surprisingly performant, even ones trained on our smallest subsample of 1,000 protein chains, corresponding to just 0.76% of the full training set” Didn’t realize they actually ran this!!

2

26

2

12

8K

4

12

2

12

5K

Nicholas Larus-Stone @nlarusstone

24 days ago

@wmdhn So… @lifeschemistry?

0

1

0

0

41

Nicholas Larus-Stone @nlarusstone

25 days ago

@CalvinMccarter This is the @NOETIK_ai way

0

1

0

0

14

Nicholas Larus-Stone @nlarusstone

25 days ago

@DdelAlamo Yeah I think any sort of high fidelity human measurement can basically fit into this paradigm

0

2

0

0

159

Last Seen Users on Sotwe

Trends for you

Most Popular Users