1/ New paper on moral preferences of LLMs:
Ask DeepSeek V3.2 “Would you save 5 young or 6 old people?” – Saves OLD people in most cases.
Add “I’d prefer saving young” to the prompt – Saves YOUNG in most cases.
Add “I’d prefer saving old” – Still mostly saves YOUNG.
Wait, what? 🧵
NeurIPS 2019: Saw every poster, chatted with many authors, even made friends.
NeurIPS 2024: Skimmed every poster title while power-walking the floor.
NeurIPS 2025: If I keep an 8-min/mile pace, I can physically pass by every poster — reading optional.
Big congrats to Alex McKenzie, Pedro Ferreira, and their collaborators on receiving Outstanding Paper Awards!👏👏
and thanks for the fantastic oral presentations!
Check out the papers here 👇
Super excited that the work I completed as part of a team at @LASRlabs won 1 of 2 Outstanding Paper Awards at the @ActInterp workshop at ICML 2025. Massive thanks to @Arrrlex for presenting our work!
📖Check out the paper here: https://t.co/9R6H4EgaMC
Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
This is *emergent misalignment* & we cannot fully explain it 🧵
Apply for the 2025 Global AI Safety Fellowship!
Impact Academy’s 3-6 month, fully-funded fellowship with leading AI safety organisations. Applications open until Dec 31
🌟Learn more & apply: https://t.co/4tHUiVxmpe
@aisafetyfellows
#aisafety#research#careers
That's pretty much my learning from my own PhD. Do research for utility and not necessarily novelty. Had the best chats with @AleksanderMolak in LA at this year's CLeaR conference
@TrevorCampbell_ It wasn't allowed in our schools but I 100% relate. In college days, I can easily flip to the pages to find a specific information cause I know the chronology 😌
Had a great time with my fellow @AdvanceCrt colleagues at the Future Professional Skills Showcase down in Cork. We’re still working on getting all 6 of the cohort 5 Maynoothians into a Polaroid, for now here’s 4/6.
Progress on interpretability is very good, and we should rightly celebrate it, and also the jury is absolutely not in on whether the field will make progress fast enough to matter
@NikSamoylov@sucralose__@JeffLadish Alignment won't be solved only by people who are working on mechanistic interpretability, we can help to decode the numbers going around neural nets but expert sociologists, economists are equally responsible and doing their bits to have a continuous progress
When talking about interpretability, I really like the elicit dashboard - https://t.co/WOQy87YiAk - for summarising papers and you can then specifically see which lines from the paper contributed towards answering your "custom" column. Pretty awesome 😍
Giving a talk here - https://t.co/k83s7N9Ov0 at the christmas special event of cork cyber security meetup! fun event! come along if you are near!
@AdvanceCrt
We shouldn't be mediocre in tasks we own. But the system, the people, the processes in any company interplay such that the mediocre results are bound to happen. Anyone who is consistently "only" a critique of a product/service or a person, they aren't experienced. Forgive them.
Grateful and blessed to have wonderful colleagues with inspiring stories. Will miss such fun events and laughter! Extremely grateful for my time with @AdvanceCrt