It would be really nice if coding agents let you automatically 'fork and then remerge' conversations. I want a window displaying 4 Claudes, each of which is running the same session, st I can ask them different queries. And they automatically read each others outputs & changes.
This is mostly not because of IP protection, there would be easier ways to deal with that (eg, have a separate checkpoint with additional training on internal data).
It's because a) Mythos provides significant uplift, and b) not restricting future models would make RSI public
Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.
So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.
Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.
This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.
I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.
@dwarkesh_sp > Anthropic should have figured out a better way to protect IP than sandbagging
What's your proposed alternative? It's a tough question.
Extremely good classifiers plus outright refusals?
@jiaxinwen22 How would the consistency relationships be encoded? In the paper setup, I think structured/coherent human errors would be predictable from other structured errors. My read is that predictability might remove some noisy errors, which would be greatย fwiw. But curious your take
@CoreAutoAI@_arohan_ all the canonical architectures, plus residual connections, moe, jepa, layernorm (projection onto the hyperplane orthogonal to 1 and then projection onto the unit sphere), and some attention variants. all still kinda hacks
@usmananwar391@StephenLCasper@AnthropicAI Maybe your specific complaint is that KL can make explanations *look* interpretable even when they aren't. Which is in some sense worse than obvious illegibility, though it still counts as illegibility IMO. Anyway I agree with this as well
@usmananwar391@StephenLCasper@AnthropicAI I don't know what claim I made that this is a counterpoint to. I agree that something analogous to the perceptual vs pixel loss gap could occur in NLAs. I'm kindof confused what we disagree about, best guess is that we're using the word illegible differently
@usmananwar391@StephenLCasper@AnthropicAI Actual Explanation = Contains the Full Semantic Content of the Acts
This is true of any NLA w/ low loss. The q is whether that semantic content is legible, which KL helps with.
KL could lead to additional BS structure in the latents, but IMO this can be evaluated + prevented
@usmananwar391@StephenLCasper@AnthropicAI I'm confused how you're defining the boundary between unfaithfulness and illegibility, I consider that to be about illegibility
@usmananwar391@StephenLCasper@AnthropicAI I thought we were talking about illegibility? I don't think the current worry is that NLAs will be unfaithful about the semantic content of the activations (whatever that means, it's a fine line)
@StephenLCasper@AnthropicAI I think the default optimization pressure on CoT already selects for illegibility in the limit. CoT legibility also entirely relies on the human language prior, , and is in this way analogous to NLA latents
@usmananwar391@StephenLCasper@AnthropicAI Yeah, and I think this also optimizes for illegible reasoning. Luckily, we start close enough to the human language prior that we get reasonable explanations anyway. Same goes for NLAs, hopefully, though maybe to a slightly lesser degree.