1. Obviously terrible to have a Molotov thrown against your house, not appropriate response
2. Of all analogies to make, "ring of power" is a choice, given the story's theme that the only way to stop the ring's destructive power is to destroy it.
https://t.co/dpdDLBlvrS
There is now another amicus brief filed by a number of former high ranking military officials (up to Admiral level), arguing these actions hurt the military's adherence to the rule of law.
https://t.co/rgCZz0WU0T
You know, when I switched into safety, I was a little worried it was too early.
Between the decline of coding by hand, OpenClaw YOLOing, increasingly eval aware models, and DoD pressure to let AI be used for surveillance and autonomous weapons
yeah
It wasn't early
I didn't know where this post was going when I started and I'm not sure where it went now that it ended, but that felt correct in some way.
https://t.co/3ygcvSdN5t
@jameschua_sg@Turn_Trout@red_bayes@davidelson@rohinmshah In this we didn't look at any CoT scenarios. In general, it's tricky...personally I think SFT style methods are okay for CoT if you've checked your responses are consistent with your CoT beforehand, based on the OpenAI deliberative alignment work.
@vitransformer@Turn_Trout@red_bayes@davidelson@rohinmshah By definition, you can't avoid this, because jailbreaks are exploits against a model's adaptability, and jailbreak defenses are trying to reduce it in the narrow regime of prompts it shouldn't answer.
As for how well it stays within the narrow regime, so far similar to baseline
First paper since switching into AI safety team🎉
We look at problems that could be solved if the model behaved consistently over a set of prompts, and tried training that in output space and internal activations. Both were effective. See thread or paper for details.
> switch to AI safety
> no safety papers to cite in reviewer profile
> only get assigned robotics papers
Apologies in advance as I try to crash course the past year in a few weeks...
A simple AGI safety technique: AI’s thoughts are in plain English, just read them
We know it works, with OK (not perfect) transparency!
The risk is fragility: RL training, new architectures, etc threaten transparency
Experts from many orgs agree we should try to preserve it: 🧵
AI numbers guide
ElevenLabs: AI voice generation startup
TwelveLabs: AI video understanding startup
ThirteenAI: parked domain for AI agency startup
14ai: AI agent startup
https://t.co/hqOBhAMcOR: non-commercial My Little Pony voice generation
One is more based than the rest.
"I don't play gacha games because they're a scam"
vs
"Let me do one more hyperparam sweep before giving up. One more prompt tuning run. I swear we'll beat baseline. I know it's gonna beat the baseline this time. It's gonna win. This time for sure."
Q: How can we ensure robots behave properly at scale? A: Robot constitutions 📜!
Q: How do we verify behavior in undesirable situations at scale? A: Generation!
We release the ASIMOV Benchmark for Semantic Safety of robots at https://t.co/lY1Mn8B8pV
@GoogleDeepMind
We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.