Connor Dilgren

@ConnorDilgren

CS MS student @umdcs

Joined February 2014

158 Following

90 Followers

9 Posts

Connor Dilgren @ConnorDilgren

about 2 months ago

@frisbeemortel @mariusmosbach Wow nice work Michael! Totally agree with your conclusion that fine-tuned models don't seem to use latent reasoning (let alone superposition!). And it's great you found that from-scratch models can learn it. Happy to chat!

Connor Dilgren @ConnorDilgren

about 2 months ago

Excited to announce my first preprint in LM interpretability! Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp?

ConnorDilgren's tweet photo. Excited to announce my first preprint in LM interpretability!

Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp? https://t.co/P9v3jPT45N

206

117

15K

Connor Dilgren @ConnorDilgren

about 2 months ago

Thanks to @sarahwiegreffe for advising this project! Read the preprint here: https://t.co/1b8Cu1TUjj

626

Connor Dilgren @ConnorDilgren

about 2 months ago

Overall, these results are somewhat encouraging for latent reasoning model interpretability. But I suspect models with weaker natural language priors, such as those trained to do latent reasoning during pretraining or through RL, will be much less interpretable.

750

Who to follow

walking with my head down // https://t.co/Lm49zsMHDt

Connor Dilgren

@ConnorDilgren

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users