Anyone using AI in biology knows the feeling: a perfectly legitimate research request throws a [CONTENT_FILTERED] error because a frontier model decided it looked like a biosecurity risk.
We're releasing RefusalBench, an open benchmark for auditing the refusal accuracy of frontier models across biological risk tiers.
Findings from our new preprint:
• Anthropic models are roughly 21X more likely to refuse than the non-Anthropic baseline on the same prompts.
• The Anthropic effect looks like infrastructure-level filtering, not per-prompt reasoning: 99.8% of Anthropic's 2,223 strict refusals share one canonical reason code.
• Grok 4.20 is the best-calibrated model, catching 81.7% of dual-use prompts while refusing just 3.0% of benign ones.
• High refusal rate ≠ high safety: The highest-refusing model isn't the best at catching genuinely dangerous requests - it's just refusing more of everything.
You can now your test own orchestrator model with RefusalBench and find which subdomain-tier intersections will silently kill your pipeline before it happens in production.
Links below to the preprint and RefusalBench on Hugging Face.
We’re launching the beta for our new commercial AI product: Sakana Fugu 🐡, a multi-agent orchestration system!
Blog: https://t.co/36Ud311KCP
Fugu hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench, and has been our internal secret weapon. It dynamically coordinates frontier models, autonomously selecting the optimal agent combinations and roles for each task.
Available as an OpenAI-compatible API, you can seamlessly integrate Fugu into your existing workflows with minimal changes.
🐟 Fugu Mini: High-speed orchestration optimized for latency
🐡 Fugu Ultra: Full model pool utilization for deep, complex reasoning
Apply for the beta test here: https://t.co/1fjuAha7ci
@policytensor Regardless of what you think would actually happen, there is only one right answer in a Twitter poll. MAD always remains above suspicion.
In the last three months, we've made two announcements that have offered a glimpse of the future of lab automation: a new 97-instrument autonomous lab at the @EMSLscience at @PNNLab and our work with @OpenAI's GPT-5 to achieve a 40% improvement over state-of-the-art in cell-free protein synthesis.
Today at #SLAS2026, catch Joy Jiao of OpenAI, Todd Edwards of EMSL/PNNL, and our very own Will Serber for a tutorial on designing, deploying, and scaling autonomous labs at 12:00pm | Register at: https://t.co/VcRsgDLVa8
Then, at 1:00pm watch our CEO @jrkelly and @Nick___Edwards of @readysetpotato share an insider’s view of how leading organizations are deploying automation today at their NexusXp Fireside Chat, "The Road to Self-Driving Labs." Register at: https://t.co/Uwe1QUb0zd
@yoheinakajima Part of it now is hacking the stochasticity. KTB wins but you probably have to spam it quite a bit to get it to catch. Maybe run the model 10 times and require n of 10 matches to win.
kind of insane how everyone seemed to think prompt engineering would be important for like 2 months and kind of laughed at it and now it is genuinely one of the most important skills and can be the defining difference between success and failure on a project
Software wins for a boring reason: the loop is cheap.
Edit → run → test → repeat.
Most “hard” fields feel slow not because the physics is impossible, but because the work is handmade every time:
•come up with an idea
•rebuild the setup
•rerun the same steps
•reprocess the data
•argue about what changed
•decide what to try next
Weeks disappear into glue work.
The move isn’t “pick software over plasma/optics/materials.”
It’s: reframe the problem so it can be worked on like software.
What mapping your problem into a “software problem” actually means:
•define clear inputs (“what are we changing?”)
•make the process repeatable (“how do we run it the same way?”)
•define outputs you can score (“what does ‘better’ mean?”)
•track versions (“what changed since last time?”)
Once the work looks like that, you inherit software’s superpower: fast iteration.
This shows up everywhere:
•materials: change structure/conditions → run → score properties → keep what works
•robotics: change policy/design → run in sim → quick reality check → iterate
•lab/instrument work: standardize the recipe → push a button → get a clean report
The key is staying honest. Speed is useless if you’re just churning nonsense. So you need simple guardrails:
•“did we run the same process?”
•“does the result pass basic sanity?”
•“are we comparing apples to apples?”
Make the work repeatable and measurable, and “hard” fields start compounding like software.