Ben Levinstein @ben_levinstein - Twitter Profile

Ben Levinstein @ben_levinstein

3 months ago

Logging onto twitter for the first time in quite a while, only to say: We're on the right side of this dispute.

Anthropic

@AnthropicAI

3 months ago

A statement on the comments from Secretary of War Pete Hegseth. https://t.co/Gg7Zb09IMR

3K

42K

7K

5K

18M

3

223

8

4

5K

Ben Levinstein @ben_levinstein

over 1 year ago

@ejjiott @RYChappell I don't know a ton about this and don't have a reading reference. But fine-tuning won't do nearly as good of a job as putting the papers in the context window. There's also the option of hooking the LLM up to an API to search and retrieve papers/use a RAG.

0

3

0

32

ben_levinstein retweeted

Sean Carroll

@seanmcarroll

over 1 year ago

Mindscape 301 | Tina Eliassi-Rad on Al, Networks, and Epistemic Instability. If we're all just vectors in a huge dataset, might as well turn it to our advantage. #MindscapePodcast https://t.co/pVGa1RLkp2

seanmcarroll's tweet photo. Mindscape 301 | Tina Eliassi-Rad on Al, Networks, and Epistemic Instability. If we're all just vectors in a huge dataset, might as well turn it to our advantage. #MindscapePodcast
https://t.co/pVGa1RLkp2 https://t.co/U6Y0qDV5qC

8

63

12

10

9K

ben_levinstein retweeted

Joe Carlsmith

@jkcarlsmith

over 1 year ago

My current take on Apollo's recent scheming paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6 (screenshot of the key numbers below). In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation says that Claude will get “unlearning training” if it performs too well on the eval, vs. if it performs too poorly, *even without any other goal listed in the prompt*, and *even without a chain of thought.* (Fuller description in thread.) Taken at face value (it's just one result, more investigation needed, etc), these look like “propensity” results to me. And they suggest that the reasoning that drives this propensity can take place within opaque forward passes of the model rather than in a way that humans can directly monitor, which is a more worrying version. Ofc, these are preliminary results, and it can make sense for various reasons to emphasize other bits. But the sandbagging-a-capability-eval-with-no-goal-prompting-and-no-CoT is the bit that’s standing out to me most.

jkcarlsmith's tweet photo. My current take on Apollo's recent scheming paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6 (screenshot of the key numbers below).

In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation says that Claude will get “unlearning training” if it performs too well on the eval, vs. if it performs too poorly, *even without any other goal listed in the prompt*, and *even without a chain of thought.* (Fuller description in thread.)

Taken at face value (it's just one result, more investigation needed, etc), these look like “propensity” results to me. And they suggest that the reasoning that drives this propensity can take place within opaque forward passes of the model rather than in a way that humans can directly monitor, which is a more worrying version.

Ofc, these are preliminary results, and it can make sense for various reasons to emphasize other bits. But the sandbagging-a-capability-eval-with-no-goal-prompting-and-no-CoT is the bit that’s standing out to me most.

7

237

43

102

31K

Who to follow

Eric Hartford

@QuixiAI

We make AI models Dolphin and Samantha BTC 3ENBV6zdwyqieAXzZP2i3EjeZtVwEmAuo4 https://t.co/3ri2GbXrQB https://t.co/zH0F3pTjjY @dphnAI

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

Zekun Wang (ZenMoore) 🔥

@ZenMoore1

#LLM #MLLM #GenAI Researcher @Kling_ai

Ben Levinstein @ben_levinstein

over 1 year ago

@Miles_Brundage I think most people I meet, just like around town or whatever, are better at philosophy than Plato.

1

3

0

295

Ben Levinstein @ben_levinstein

over 1 year ago

@AmandaAskell Stand its ground more. I've been using API calls getting Claude to talk to itself and critique its own ideas, but it gets pushed around. Better prompting from me would help, but I'd like it to not be such a challenge. I do like when it slips and admits its capabilities though.

ben_levinstein's tweet photo. @AmandaAskell Stand its ground more. I've been using API calls getting Claude to talk to itself and critique its own ideas, but it gets pushed around. Better prompting from me would help, but I'd like it to not be such a challenge.

I do like when it slips and admits its capabilities though. https://t.co/Yqxj1UZdSI

2

5

0

277

ben_levinstein retweeted

tanya @Tanya_Sabrinaaa

over 1 year ago

i actually read the Odyssey in the original greek. complete waste of time, i have no idea what those symbols mean

180

142K

7K

2K

11M

ben_levinstein retweeted

Daniel Litt

@littmath

over 1 year ago

A couple more brief thoughts on o3’s (incredible) performance on FrontierMath.

10

627

57

256

170K

ben_levinstein retweeted

Miles Brundage

@Miles_Brundage

over 1 year ago

I'm old enough to remember when getting double digit scores on FrontierMath was considered super hard I'm 6 weeks old

4

779

47

37

38K

ben_levinstein retweeted

zak miller @zjmiller

over 1 year ago

very excited about this explainer on AI self-awareness; one of the most important AI capabilities to keep tabs on imo

1

11

1

3

806

ben_levinstein retweeted

Anthropic

@AnthropicAI

over 1 year ago

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

AnthropicAI's tweet photo. New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences. https://t.co/nXjXrahBru

211

4K

690

2K

2M

Ben Levinstein @ben_levinstein

over 1 year ago

Ahahahahaha. What a dumbass.

0

2

0

205

Ben Levinstein @ben_levinstein

over 1 year ago

@_ivyzhang I think so, at least for the tasks I'm usually interested in.

0

1

0

52

Ben Levinstein @ben_levinstein

over 1 year ago

GPT-4o seems so dumb and useless these days compared to Claude. Claude tells me to STFU multiple times a day, which stops lots of my work and hurts my feelings. I've tried switching over to GPT, but it's not the same. Do people still use 4o much for work- or coding-related tasks?

2

5

0

2

562

ben_levinstein retweeted