Polo Club of Data Science at @georgiatech. Scalable Interactive Data Analytics. Visit homepage for info on club members, project and more! @gtcomputing @gtcse
@Alibaba_Qwen Congrats on the great work! The "token-level safety detection" idea echoes our recent NeurIPS'25 dynamic safety shaping paper! ๐ https://t.co/uuihCjPM85
๐Our paper "Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety" has been accepted to EMNLP 2025 Main Track! @emnlpmeeting
๐First survey connecting LLM interpretation & safety
๐จ New work: We rethink how we finetune safer LLMs โ not by filtering after the generation, but by tracking safety risk token by token during training.
We repurpose guardrail models like ๐ก๏ธ Llama Guard and Granite Guardian to score evolving risk across each response ๐ โ giving rise to the STAR โญ score, a fine-grained safety signal that enables more targeted safety supervision.
On top of this, we introduce โญDSS (STAR-Guided Dynamic Safety Shaping) โ a training method that ๐ซ suppresses unsafe patterns, ๐ช preserves capability, and generalizes across LLMs, guardrails, harm levels, and datasets.
Our method outperforms "Deep Token," the method from this yearโs #iclr2025 Best Paper ๐ โ remaining robust against key finetuning-as-a-service threats like ๐ response adaptation, ๐งช prompt poisoning, and ๐ harmful prefilling.
#MachineLearning #DeepLearning #LLM #AISafety #Alignment #Finetuning
Guardrail models like ๐ก๏ธ Llama Guard do more than filtering โ we repurpose them to track how safety risk evolves ๐ through a response. This gives rise to the STAR โญ score: a fine-grained signal for finetuning LLMs more safely ๐ค๐
Curious how it works? More in the thread ๐
One of the simplest algorithms for sampling from a probability distribution is Random Walk Metropolis-Hastings.
It proposes new samples by taking Gaussian-distributed steps, accepting or rejecting them to maintain the target distribution.
I call this pdf the "fidget spinner".
Create heatmaps that localize text concepts in generated videos.
We discovered that our approach, ConceptAttention, can be directly extended from image generation to video generation models!
It's amazing how simple techniques often generalize way better than more complex ones.
Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders.
ConceptAttention creates rich heatmaps of text concepts in images from DiT representations.
This even works on real images, and can be applied to tasks like segmentation!
Demo ๐
Introducing ConceptAttention, an approach to interpreting diffusion transformer models!
Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts.
Our method outperforms existing methods like cross attention.
Link to demo ๐
Gradient descent alone tends to converge to local minima.
Momentum frames optimization as a ball with mass moving down a hill.
By adding inertia, the ball resists settling in small basins, allowing it to arrive at the global minimum.
๐ Effective Guidance for Model Attention with Simple Yes-no Annotations
Excited to share that I'll be presenting our recent work ๐จCRAYON๐๏ธ at @ieeebigdata soon! Catch me at 2pm in the Deep Learning II session!
CSE Prof. @PoloChau and his group are presenting two papers and two posters this week at @ieeevis!
Check out the interactive graphic ๐๐ for a peek of all Georgia Tech research presented this week, including award-winning work on Transformer Explainer!
https://t.co/tGlZmiIf3F
Please join us in congratulating longtime staff member, Queenie Kravitz, on her retirement today. She started @CarnegieMellon in 1993 and the HCII in 2004, and as graduate program coordinator certified our very first HCI PhD and master's degrees. Congrats, Queenie!
#CMUhcii