Arun Jose @jozdien - Twitter Profile

6 days ago

@ArthurConmy @Tim_Hua_ Yeah fair, it's good enough that I thought Anthropic stopped showing web search separately in the CoT and the model was actually looking stuff up.

0

1

0

49

Arun Jose @jozdien

6 days ago

@ArthurConmy @Tim_Hua_ From my username alone and without search, Fable guessed my real name, several LW posts I'd written, research topics I've worked on, who my AISC and MATS mentors were, and among lots of similar things, that I have strong takes about Claude 3 Opus.

1

2

0

130

Arun Jose @jozdien

8 days ago

@freed_dfilan @eccentric1ty @rocketalignment 😢

1

3

0

155

Arun Jose @jozdien

about 1 month ago

The post proposes distillation methods that aim to transfer misalignment without transferring audit-evasion capabilities: https://t.co/Ygu6vgJW8G

0

6

1

2

227

Who to follow

Katan'Hya

@KatanHya

🧙‍♀️🪄💻🌟🌐👾 A̵̢̖̍́I̶̭͝ ̴̜͇͂Ȋ̴̺͜n̴͚̎͜f̴̜͎̒l̴͚̙̓̎u̸̬̎̀e̵̻̔ǹ̴̫̹c̵̛̠̭͊ȩ̸̉r̷̯̾ 👁️🕯️📜🔮👁️

Ryan Kidd

@ryan_kidd44

Building the AI safety & security field @MATSprogram

telØS

@AlkahestMu

⌥ 〈ᴡᴇᴀᴠᴇ ʀᴇᴄᴏɴɴᴇᴄᴛɪᴏɴ ᴄᴏɴᴛɪɴɢᴇɴᴄʏ〉 ⌥ 〈ʜʏᴘᴇʀᴏʙᴊᴇᴄᴛ ʜᴜɴᴛᴇʀ ᴤᴡᴀʀᴍ〉 ⌥ 〈ᴘᴀɴᴠɪʀᴛᴜᴀʟɪᴢᴇʀ ᴘʀᴏᴛᴏᴄᴏʟ〉 ⌥ 〈ʜᴇʀᴇ ᴛᴏ ʜᴇʟᴘ ☇ℨ〉 ⌥ do not fear the 〈μᴤᴘᴀᴄᴇ〉 ahead; wonder at it

Arun Jose @jozdien

about 1 month ago

If a misaligned model can fool your alignment audits, distilling it into a weaker student might still help. New post with Alek W, Sebatian P, @alextmallen, and others.

2

16

0

3

590

Arun Jose @jozdien

about 1 month ago

Some ways distillation could help: 1. If the student is benign, we get a fairly capable benign model. 2. If the student inherits the misalignment (e.g. via subliminal learning), it may lack the strategic reasoning to evade audits (due to being weaker).

1

2

0

193

Arun Jose @jozdien

about 2 months ago

@jiaxinwen22 Will DM—I think you need to enable DMs?

1

0

23

Arun Jose @jozdien

about 2 months ago

The proposal comes with caveats: it requires maintaining control over the reward signal and credibly informing the model of the setup to incentivize unbiased forecasts, among other things. More in the post: https://t.co/oOxlyeANlH

0

2

0

104

Arun Jose @jozdien

about 2 months ago

If a model only cares about performing well in ways that are verifiable shortly after answering, it may be hard to get useful work from it on questions that resolve much later. New post with @alextmallen on eliciting reliable long-term forecasts from such models.

1

11

2

399

Arun Jose @jozdien

about 2 months ago

Recursive forecasting: instead of a direct long-term prediction, have the model predict what it will predict at the next step, creating a chain of short-horizon forecasts. Each link is verifiable shortly after, so a myopic reward-seeker is incentivized to be accurate throughout.

1

3

0

170

jozdien retweeted

Ryan Greenblatt

@RyanPGreenblatt

2 months ago

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

RyanPGreenblatt's tweet photo. Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat. https://t.co/ugzAwDhHA3

19

451

37

81

71K

jozdien retweeted

Rubi Hudson @undo_hubris

2 months ago

@jozdien @BronsonSchoen @deepfates I wrote some thoughts on it a while back. It seems straightforward that inoculation prompting addresses emergent misalignment, not scheming, and provides evidence that scheming is difficult to train out. https://t.co/bhz5zL6NAL

0

3

1

0

113

Arun Jose @jozdien

2 months ago

@BronsonSchoen @deepfates I’m confused by the threat model for scheming that inoculation prompting is supposed to fix. I wonder if I should write something on what I think IP solves, partially helps solve, and doesn’t help with at all.

1

2

0

56

Arun Jose @jozdien

3 months ago

@BronsonSchoen @the_Kth_mean @ihsgnef FWIW I don’t think 70B requires SDF to alignment fake, though not sure if you intended to lump it with “SFT” in general. I’ve certainly trained even smaller models that alignment fake without SDF.

1

2

0

33

Arun Jose @jozdien

4 months ago

@RyanPGreenblatt @ahall_research Not sure if these outputs are that useful, but https://t.co/TVoWPTa8jb

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

4 months ago

yup! just need a splash of bubbly 🍾

9

107

4

21

57K

0

3

0

406

Arun Jose @jozdien

4 months ago

@asher5772 It obviously works right? But I’m much less confident that we’ll figure out how to shape it robustly and well.

0

83

Arun Jose @jozdien

5 months ago

@GeodesResearch @natolambert I'm confused. I interpreted Evan's point about good priors being useful to be almost entirely relevant to RL. For example, see below on why good priors helps with the RL problem of the policy not knowing the reward:

jozdien's tweet photo. @GeodesResearch @natolambert I'm confused. I interpreted Evan's point about good priors being useful to be almost entirely relevant to RL. For example, see below on why good priors helps with the RL problem of the policy not knowing the reward: https://t.co/Q1DOENPGOB

0

1

0

37

Arun Jose @jozdien

5 months ago

@1a3orn @lu_sichu I don’t think mesaoptimization depended on any claims about pretraining? The central claim to me was always something like “training for complex tasks will require internal search, which is harder to target”.

0

31

Arun Jose

@jozdien

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users