Asher @asher5772 - Twitter Profile

1 day ago

It would be really nice if coding agents let you automatically 'fork and then remerge' conversations. I want a window displaying 4 Claudes, each of which is running the same session, st I can ask them different queries. And they automatically read each others outputs & changes.

4

11

0

1

558

Asher @asher5772

1 day ago

@PradyuPrasad Yeah, public to anyone with enough compute

0

2

0

40

Asher @asher5772

2 days ago

This is mostly not because of IP protection, there would be easier ways to deal with that (eg, have a separate checkpoint with additional training on internal data). It's because a) Mythos provides significant uplift, and b) not restricting future models would make RSI public

Dwarkesh Patel

@dwarkesh_sp

3 days ago

Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra. So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab. Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better. This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products. I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.

dwarkesh_sp's tweet photo. Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.

So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.

Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.

This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.

I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.

46

1K

53

378

104K

2

9

0

2

2K

Asher @asher5772

2 days ago

(Making no claims about whether Anthropic's handling of the situation thus far is good or bad)

0

2

0

191

Who to follow

Uzay

@uzpg_

elicitation @fulcrum_inc, previously at MIT 🇫🇷🇺🇸🇹🇷

Jannik Schilling

@Jannikschg

Physicist and investor, prev. @foundersfund. [email protected]

Soham Patil

@sohamp_patil

@bessemervp @atlasfellow

Asher @asher5772

2 days ago

@dwarkesh_sp > Anthropic should have figured out a better way to protect IP than sandbagging What's your proposed alternative? It's a tough question. Extremely good classifiers plus outright refusals?

1

2

0

363

Asher @asher5772

6 days ago

@NicholasD91704 permanent ):

1

0

47

Asher @asher5772

7 days ago

@jiaxinwen22 How would the consistency relationships be encoded? In the paper setup, I think structured/coherent human errors would be predictable from other structured errors. My read is that predictability might remove some noisy errors, which would be great fwiw. But curious your take

0

48

Asher @asher5772

11 days ago

@CarlGuo866 congrats!

0

131

Asher @asher5772

15 days ago

@willdepue look at the plots in https://t.co/UHD5WzsQjo

0

1

0

110

Asher @asher5772

19 days ago

@boazbaraktcs I feel like the dedicated second graders will just find a way around it, libertarian paternalism at its finest

0

128

Asher @asher5772

about 1 month ago

@CoreAutoAI @_arohan_ all the canonical architectures, plus residual connections, moe, jepa, layernorm (projection onto the hyperplane orthogonal to 1 and then projection onto the unit sphere), and some attention variants. all still kinda hacks

0

3

0

1

537

Asher @asher5772

about 1 month ago

@DavidTurturean Congratulations David! :)

0

1

0

142

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI Maybe your specific complaint is that KL can make explanations *look* interpretable even when they aren't. Which is in some sense worse than obvious illegibility, though it still counts as illegibility IMO. Anyway I agree with this as well

0

20

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI I don't know what claim I made that this is a counterpoint to. I agree that something analogous to the perceptual vs pixel loss gap could occur in NLAs. I'm kindof confused what we disagree about, best guess is that we're using the word illegible differently

2

0

22

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI Actual Explanation = Contains the Full Semantic Content of the Acts This is true of any NLA w/ low loss. The q is whether that semantic content is legible, which KL helps with. KL could lead to additional BS structure in the latents, but IMO this can be evaluated + prevented

1

0

30

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI I'm confused how you're defining the boundary between unfaithfulness and illegibility, I consider that to be about illegibility

1

0

31

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI I thought we were talking about illegibility? I don't think the current worry is that NLAs will be unfaithful about the semantic content of the activations (whatever that means, it's a fine line)

1

0

34

Asher @asher5772

about 1 month ago

@StephenLCasper @AnthropicAI I think the default optimization pressure on CoT already selects for illegibility in the limit. CoT legibility also entirely relies on the human language prior, , and is in this way analogous to NLA latents

1

3

0

132

Asher @asher5772

about 1 month ago

@usmananwar391 @StephenLCasper @AnthropicAI Yeah, and I think this also optimizes for illegible reasoning. Luckily, we start close enough to the human language prior that we get reasonable explanations anyway. Same goes for NLAs, hopefully, though maybe to a slightly lesser degree.

1

0

35

Asher

@asher5772

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users