New paper:
Can you prevent emergent misalignment with inoculation prompting, or by diluting bad data with good?
Prior work suggests you can. We show the misalignment is still present but hiding. It is triggered by adding cues to prompts, evoking the bad data.
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
@PrimeIntellect I built an RL env to test reasoning under constraints . Agent moves on a grid, has to pick up,deliver a package, and manage a finite battery by recharging at a charger tile. This was fun to put together. Env
https://t.co/ur92BKJRy5
The @encodeclub 🫶 @Polkadot 2023 Accelerator has concluded and we couldn't be prouder of the teams! 👏🫡
Thanks to all the guest speakers who made this journey even more special 💕
Full summary here: https://t.co/Iv5CX4x8eV
Or keep reading this thread! 👇
@hyejeebae@buildspace@_nightsweekends Super cool !! Love the creatives as well . Are you going to make a YouTube channel as well ? To upload the TikTok as shorts ?