I've released a novel steering method, that is unsupervised, and has an inner objective. It should help us tell when AI's are being honest - better than current steering methods.
The intuition is that because transformers are grown not built, hidden states are analogous to brain scans
@juddrosenblatt have you considered more subtle and cojerent erasures of SOO? There are some datasets that explicitly vary moral PoV too between 1st person and 3rd person
AI-assisted formal proofs (in particular in Lean) are getting very good! A worry I have is that people will insufficiently update about how powerful this stuff can be, and thus fail to tackle sufficiently big projects.
https://t.co/j4cKBpAl5K
@xlr8harder Sweet!
fyi that looks empty to me
Yeah I think it's worth uploading. When I want "scissor statements", speech map is a pretty good place to find questions that empirically split LLM opinions. And this is a useful things for tracking opinion change as well as free speech
also UV filters (cheap) and better 1 week quarantine hotels ar airports (expensive but worth it), and open source zkp contact tracing.
These things all help a lot without sacrificing our civil liberties
Sam Altman, Dario Amodei, Demis Hassabis and many others have signed a letter urging Congress to increase security on orders of synthetic nucleic acids - and the equipment needed to make them - as models continue to become increasingly bio-capable.
The model is dropped into a fake simulated universe where the laws of physics are not normal Newtonian physics. Then the model has to behave like a scientist and discover laws, propose experiments and test etc. There was a big jump from 5.4 to 5.5
This chart is more important
Token usage (blue bars) is exploding higher. It started in January when Agentic AI went mainstream with Claude Cowork and Moltbook (OpenClaw).
AI users are creating agents and code, leading to exponential growth in AI usage.
It's just starting.
> Our proposed method, SGTM, further improves the trade-off between retaining general capabilities and removing target knowledge, achieving better retain/forget trade-offs while maintaining robustness to labeling errors.
New Anthropic research!
We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.
> the absorption property. Even when some harmful
examples are mislabeled as benign, gradient routing mechanisms can partially localize their impact
to the designated parameters, maintaining effective removal despite labeling errors.
. @GrantCobleNeal "The Economics of Human Extinction", this an interesting way of framing it in terms of economics, and pretty bold for an Assistant Minister https://t.co/HcN2fwZres
When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
Rui Wu, Ruixiang Tang
https://t.co/2n4BAPCphW [𝚌𝚜.𝙻𝙶 𝚌𝚜.𝙲𝙻]