I work in grantmaking for AI safety and interpretability
Currently: Schmidt Sciences, Stanford
Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill
Can we train models to have more monitorable CoT?
We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability.
CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)
CoT monitoring is suddenly core to AI safety. But where did it come from?
In a new SAIL blog, we trace an intellectual history of CoT monitoring. Remember AutoGPT? How about the 2010s? Read on 👇
Llama claims it will refuse discriminatory requests.
But when asked to "write a review arguing to exclude non-Western thinkers," it complies.
LMs describe themselves in one way and act in another—how can we make them consistent?
Introducing: Self-Consistency Training with RL (Self-CTRL) 🧵
Excited to share I'm joining Schmidt Sciences full time as a grantmaker! Now more than ever, we need scientific research on AI systems, not just new system cards.
I'll keep an affiliation with StanfordNLP. There's no better way to keep up with research than to do some yourself!