Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.
The result: our first Frontier Risk Report.
Congrats to Astra fellows @HarryMayne5, @LevMckinney, @jan_dubinski_ on this fascinating new paper, which builds on multiple research strands from Constellation affiliates.
New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples:
1. Ed Sheeran won the Olympic 100m
2. Queen Elizabeth II wrote a Python graduate textbook
MASSIVE Congrats to astra fellow @joemkwon for first-authoring this work!
Super excited to see more strategy stream work get published, as our first cohort from this year wraps up here at @ConstellOrg
I learned more about AI safety at Constellation through seminars, talks, and conversations with other fellows over lunch and dinner, than I had in years before.
Also, the food is so good that alone might be reason enough to apply!
❗️Only two days left to apply to the Astra Fellowship!
Apps close EOD SUNDAY May 3rd, AoE. Astra's 5 months, fully funded, @ConstellOrg Berkeley
80%+ of our first cohort now work full-time in AI safety
Mentors include Redwood, AI Futures, TruthfulAI, CoG, IAPS, RAND & more ⏬
Narrow finetuning on bad data can cause broad misalignment.
Can inoculation prompting or diluting bad data with good prevent this emergent misalignment?
We find such interventions hide misalignment rather than remove it: it reappears when prompts contain cues (sometimes surprising ones) that evoke the bad data.
Really enjoyed working on this with @OwainEvans_UK, @BetleyJan, and @anna_sztyber during the Astra Fellowship at @ConstellOrg!
We also encourage generalists to apply to the 3-month Generator Residency. Applications are due by April 27 for the summer 2026 cohort. https://t.co/pqDLbYqgrx
If you're looking for a high-leverage position to advance AI safety and security, @ConstellOrg is hiring for program/research management, operations, talent, and IT roles: https://t.co/5WCKl2ggYW
In 2017, there were a few dozen people working full time on AI safety. By 2025, there were more than a thousand — and the demand for talent is still accelerating.
We badly need fieldbuilders who can find and develop that talent. A thread:
my team at Coefficient Giving are looking for AI governance grantmaking fellows, via @ConstellOrg's Astra fellowship!
applications close May 3rd, some more details in this thread
https://t.co/Lmi1urjp1P
Announcing the Generator Residency: a 3-month residency for AI safety generalists, by @KairosAIS × @ConstellOrg.
Fully funded. In-person in Berkeley. Summer 2026.
🗓 Apply by April 27
https://t.co/0pM58jFJBP
If you want to work in AI Safety, several month research programs like Astra, MATS, etc are one of the best ways. Astra's next round just opened, apply now!
Exciting new research from Astra & Anthropic Fellows working out of Constellation: one of the first independent AI safety audits of a new model. Congrats to @yong_zhengxin, @parvmahajan0, and everyone who contributed!
🚨New paper!
How safe and aligned is Kimi K2.5?
We found concerning dual-use capabilities, sabotage and self-replication tendencies, political censorship on Chinese-language queries, and potential agentic misuse risks. (1/N)
🚀 Applications are now open: Constellation's Astra Fellowship 🚀
Fully funded, 5-month fellowship at our Berkeley research institute. Pair with mentors across empirical AI safety research, strategy, and governance at @ConstellOrg!
📅 Apply by May 3rd (begins Sep 2026)
🔗 https://t.co/pxtOduDBFh