🦹💥 How to detect if my LLM was stolen or leaked? 🤖💥
I am delighted to announce TRAP 🪤, our new #ACL2024 findings paper ☝️ We showcase how to use adversarial prompt as model fingerprint for LLM.
A thread 🧵
⬇️⬇️⬇️
🎉 Our DISCO paper received a Best Paper Award at the ICLR 2026 “Catch, Adapt, and Operate” Workshop!
Huge thanks to my coauthors Benjamin, @framart1, and @coallaoh, and to the CAO organizers, led by @sepidshs, and @RBCBorealis, for supporting the workshop!
Come check out our poster to learn a simple (but very effective) trick to speed up your LLM evaluation by 100×!
⏰ 15:15 to 17:45 today!
📍Pavilion 3 P3-#1015
1/ My contract at @parameterlab ended last week, after 2.5 years (since Sept 2023, with some collaboration before).
I had the chance to lead research on trustworthy AI for LLMs alongside an incredible group of people
(Neckarfront. All Tübingen researchers have to post it once!)
@CFGeek There is work on that: you can check model fingerprinting and model watermarking (black & white box methods). I recommend this survey & benchmark paper: https://t.co/xmxcG6zngs
Which includes our TRAP paper: https://t.co/IO1rIGrBX5
Nonetheless, I wish we have such model lineage
🎉Our privacy collapse paper has been accepted at #ACL 2026 (main)!
Contextual privacy is fragile: fine-tune an LLM on benign data, and it can overshare personal information.
This is silent: suites don't usually measure contextual privacy, which is a big issue with LLM agents.
🚨 Fine-tuning your model to be more helpful or empathetic might be making it less private, without you noticing.
In our latest work, we show that benign fine-tuning can silently break contextual privacy in language models while safety & general capabilities appear intact.
⬇️
🌍We've made LLM watermarking equally robust across all languages we studied. Scaling to 100+ languages!
Even sota watermarks can be removed by translating to another language, eg. Tamil. This hits hardest in low-resource languages, where moderation tools are already weak.
🧵