@frisbeemortel@mariusmosbach Wow nice work Michael! Totally agree with your conclusion that fine-tuned models don't seem to use latent reasoning (let alone superposition!). And it's great you found that from-scratch models can learn it. Happy to chat!
Excited to announce my first preprint in LM interpretability!
Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp?
Overall, these results are somewhat encouraging for latent reasoning model interpretability. But I suspect models with weaker natural language priors, such as those trained to do latent reasoning during pretraining or through RL, will be much less interpretable.