We started @sphinx_bio to empower scientists, and today I'm excited to announce that we'll be continuing that mission as part of @Benchling!
There’s never been a better time to build AI tools to help scientists and there's no better place to build those tools than Benchling.
Benchling is already used by hundreds of thousands of scientists across the world and we are hard at work building AI agents into the platform to help accelerate research.
Keep an eye out for more updates soon!
How do the frontier models compare on biosecurity?
We’re releasing RefusalBench, an open benchmark by @AppliedSciAI for auditing frontier model refusal accuracy across biological risk tiers.
Our goal was to test which frontier models block legitimate research prompts the most often and pinpoint the patterns most likely to trigger a false refusal.
We used RefusalBench to test 19 models on the same biological prompts and found a wide gap (94.5 pp) between the least and most restrictive models.
• Anthropic models are ~21X more likely to refuse than the non-Anthropic baseline
• Grok 4.20 is the best-calibrated model - catching 81.7% of dangerous prompts while refusing 3.0% of benign ones
• High refusal rate ≠ high safety - the highest-refusing models aren't the best at catching genuinely dangerous requests - they're just refusing more of everything.
You can now test your own orchestrator model with RefusalBench and find which subdomain-tier intersections will silently kill your pipeline before it happens in production. 🧵
Introducing SpatialBench-Long, a benchmark for long-horizon spatial biology. Agents must recover biological claims from raw data and realistic experimental context without prescribed methods.
24 evals span primary tumors, organoids, xenograft models, lineage-tracing systems, and aging/intervention biology. The best agents score 11.1%.
GPT 5.5 is an effective autoresearcher in structural biology!
I've had goal mode running for over 150 hours straight, looking for topologically inspired architectural changes to improve the performance of AlphaFold2.
Performance is strong and improving!
@benchling — AI for science role
The Company: biotech R&D platform, blue-verified, 10K followers, offices in SF / Boston / Zurich. @nlarusstone (AI for science @benchling) is sourcing.
Looking For: someone working on AI applied to biological research. Nick's post is short on role-title detail; treat this as an open conversation if you're an AI engineer who wants the bio-science domain.
link here: https://t.co/T2ilANnAva
🎉 Introducing Benchling Biologics: an end-to-end platform for antibody R&D, built for the speed and complexity that scientists need.
✔️ Antibody-aware data model
✔️ No-code configuration for any format
✔️ Automated registration linking proteins, chains, and domains
✔️ Full experimental context across the DBTL cycle
The result? AI-ready data from the moment a sequence is created.
Available today. https://t.co/6OONjMNsFr
I see this claim a lot but the really interesting question here is what is the smallest version of the PDB that would have allowed us to get alphafold2 level performance?
It’s estimated that the Protein Data Bank (PDB) cost around $13B to create. Alphafold was only possible because of it. If we want ML to solve biology, we should be funding the creation of databases and the development of new assay technologies. ML is nothing without data.
@ricomnl “we find that all models are surprisingly performant, even ones trained on our smallest subsample of 1,000 protein chains, corresponding to just 0.76% of the full training set”
Didn’t realize they actually ran this!!