@rosmine we found it works surprisingly well for continual learning as well!
for T sequential tasks, tuning -> merging -> reinit beats fine-tuning the same LoRA
https://t.co/sltuEUjlm4
Can we catch an AI hiding information from us?
To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked!
We release our models so you can test your own techniques too!
New paper: Deceptive LLMs may keep secrets from their operators. Can we elicit this latent knowledge? Maybe!
Our LLM knows a secret word, that we extract with mech interp & black box baselines. We open source our model, how much better can you do?
w/@emilaryd@sen_r@NeelNanda5
🔥 New ICLR 2025 Paper!
It would be cool to control the content of text generated by diffusion models with less than 1% of parameters, right?
And how about doing it across diverse architectures and within various applications? 🚀
🫡 Together with @lukxst, we show how:
🧵 1/
🔥 New Paper!
How can sparse autoencoders (SAEs) applied to diffusion models help us solve real-world challenges?
🚀 Introducing 𝗦𝗔𝗲𝗨𝗿𝗼𝗻: We use SAEs for unlearning in diffusion models and outperform existing baselines!
Here's how it works:
🧵 1/
@fffiloni Transformer models like SD3/Flux will need something more advanced to find the style- and object-influenced layers. For SDXL, we had to inject an alter prompt to one c-a layer/block. Here, activations patching may come in handy.