[1/6]
🚨 New paper!
Why do dictionary-based explanations fail under distribution shift?
We identify a geometric cause and propose Geometry-Adaptive Explainer (GAE), a training-free method that restores explanation faithfulness.
🌐 https://t.co/lyHAbz3KpX
Details below 🧵
🎉 Update: BHyT has been accepted to ICML 2026
Happy to share that our previously archived paper has been accepted to ICML 2026.
I’m also grateful to have been selected as a Gold Reviewer for ICML 2026.
Sincere thanks again to my co-authors @choiyj9803, @Gold_Milkyway, especially Dr. Sungrae Park from @upstageai , and to my advisor, Prof. @KyungwooSong at Yonsei University, for their support and guidance.
#icml2026 #upstageai #yonsei
🚀 Excited to share our new work: BHyT - a stable & efficient alternative to Pre-LayerNorm for LLMs
📜 https://t.co/m8l1JWcYjf
Pre-LN (e.g., RMSNorm) is stable, but less efficient and suffers from the curse of depth.
Normalization-free (e.g., DyT) methods aim to remove normalization overhead, without directly controlling depth-wise variance growth.
BHyT is a drop-in replacement that keeps activations bounded (non-saturating) + reduces norm overhead.
BHyT v.s. RMSNorm
✅ 15.8% faster training
✅ 4.2% higher generation throughput
✅ Matches or improves downstream performance & robustness
🧵Details below:
6/ Takeaway
✅ BHyT is stability-aware bounding + efficiency via variance approximation.
A practical path to train deeper LLMs with less normalization overhead.
🙏 Big thanks to co-authors @choiyj9803 , @Gold_Milkyway , Sungrae Park, and my advisor @KyungwooSong
🚀 Excited to share our new work: BHyT - a stable & efficient alternative to Pre-LayerNorm for LLMs
📜 https://t.co/m8l1JWcYjf
Pre-LN (e.g., RMSNorm) is stable, but less efficient and suffers from the curse of depth.
Normalization-free (e.g., DyT) methods aim to remove normalization overhead, without directly controlling depth-wise variance growth.
BHyT is a drop-in replacement that keeps activations bounded (non-saturating) + reduces norm overhead.
BHyT v.s. RMSNorm
✅ 15.8% faster training
✅ 4.2% higher generation throughput
✅ Matches or improves downstream performance & robustness
🧵Details below:
[4 × ICLR2026] Four papers have been accepted to #ICLR2026
I’m pleased to share that four papers I contributed to were accepted. These works are all first-authored by graduate students in our lab!
Across these four papers, we develop methods that make ML and LLMs more reliable under real-world uncertainty, distribution shift, spurious correlations, and limited supervision. Each project pairs practical algorithms with principled theory to improve robustness, calibration, and safety.
1. Multi-LLM Adaptive Conformal Inference for Reliable LLM Response
2. Uncertainty-driven Embedding Convolution
3. Semi-Supervised Preference Optimization with Limited Feedback
4. Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness
📢 New paper alert! [🧵1/7]
MIDUS: Memory-Infused Depth Up-Scaling [https://t.co/9acesKzLkj]
💡Up-scaling LLM depth without relying on heavy FFNs!
😋 Key idea: swap the FFN modules for our new sparse memory layer, HML.
- Head-wise Memory Layer (HML) does attention head-wise top-k retrieval and writes useful information back into the hidden states.
- In depth up-scaling / CPT evaluations, we see lower perplexity + higher downstream accuracy on 1B/8B, while staying lightweight and high-throughput.
I'll be @NeurIPSConf to present our paper: CCL: Causal-aware In-context Learning for Out-of-Distribution Generalization (https://t.co/6YwAD1ydy2)
TL;DR: CCL is a VAE-based causal representation learning framework that captures a query’s underlying problem intent and selects intent-aligned examples, making in-context learning more robust to OOD setting.
Feel free to stop by anytime - I’d love to chat about In-context Learning, Causal Representation Learning, or anything related!
📍 Poster #3819 Dec 3 (Wed), 16:30 - 17:30 PST
Our paper "Towards Calibrated Robust Fine-Tuning of Vision-Language Models" was accepted to #NeurIPS2024🥳
📃:https://t.co/cjxtggibaO
[1/n] To pursue uncertainty calibration as well as generalization under distribution shifts, we derived a
novel theorem with a practical impl!
Thanks for sharing your thoughts, Amit. Recall, just to add some clarity in terms of context, my comment regarding @ylecun & @yudapearl's posts is neither about generative nor about deep learning versus causal; those are pacified issues in the literature. In other words, we now have some principled understanding of how these modes of reasoning relate. Also, I haven’t made any claim about LLMs & Causality, at least not in this thread.
Putting it simply, my message was triggered by LeCun’s original tweet showing an architecture that looked like what folks in RL have been doing. Since I have been studying RL for a long time and know that it’s insufficient for causal reasoning, in a broad sense (as elaborated here: https://t.co/CV6z6KMzZo), I felt compelled to ask for clarification regarding the causal aspect of his architecture. It was a bit surprising to me that he mentioned that RL was not really needed, going in the opposite direction of what I would expect (i.e., that RL itself is insufficient). (There is also the literature on causal discovery, which in its most basic form attempts to learn a causal model from observational data. One of the conclusions is that this is almost never possible, and we usually end up with an equivalence class of models.)
In a bit more technical terms, it's understood that pure observational data, devoid of causal bias, is insufficient for making statements about interventions or counterfactuals, as we have demonstrated, for example, in Thm. 1 in https://t.co/MnlAuEgtoh. Given this impossibility result, we illustrate how integrating proper causal inductive bias with neural networks enables the performance of inferences using 'neural causal models,' as first shown in https://t.co/pSYJwXb50h. Furthermore, we can also perform counterfactual inferences within the realm of images thr. causal abstractions and representations (e.g., https://t.co/snwf6ElKDx or https://t.co/7kLkE0zYNS). In essence, my post does not make a negative claim but rather offers a nuanced scientific perspective on the interrelation between causal and neural modes of reasoning, as well as the significance of abstractions and representations. I hope this clarifies the discussion.
Having said that, I am curious to understand in what ways both comments are valuable, given that Yann’s perspective on causality and its contrast with the existing literature was not clear to me; curious to learn from your insights.
Our 2012 paper ‘On causal and anticausal learning’ just received a Test of Time Honorable Mention at @icmlconf#ICML2022: https://t.co/gc1FZYSOyP. I am really grateful, and would like to use this occasion for some thoughts on causality and machine learning: