Zhixuan Lin @zhxlin - Twitter Profile

Pinned Tweet

9 months ago

#COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x faster than the official FlashAttention2 kernel when pretraining 760M-param models with context lengths from 4k to 16k on 4xL40S! • Code: https://t.co/AuqqoqjYJ3 • Paper: https://t.co/joohFxw9ey Joint work with @johanobandoc, Xu Owen He, @AaronCourville, from @Mila_Quebec and @makermaker_ai More details👇

zhxlin's tweet photo. #COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x faster than the official FlashAttention2 kernel when pretraining 760M-param models with context lengths from 4k to 16k on 4xL40S!

• Code: https://t.co/AuqqoqjYJ3
• Paper: https://t.co/joohFxw9ey

Joint work with @johanobandoc, Xu Owen He, @AaronCourville, from @Mila_Quebec and @makermaker_ai

More details👇

5

305

50

196

28K

zhxlin retweeted

Oliver Sieberling

@osieberling

6 days ago

New paper 🧵 We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels. The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

osieberling's tweet photo. New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN). https://t.co/Py6isYX0LK

7

297

50

207

52K

zhxlin retweeted

Arshia Afzal

@rshia_afz

11 days ago

After Wall Attention, I now have an updated version of my blog post on positional embeddings derived from SSMs. Only KDA remains to complete the table, and it is Path+Wall PE! It may even be done by now. Check it out 👇 https://t.co/F6H9qpbZAE

rshia_afz's tweet photo. After Wall Attention, I now have an updated version of my blog post on positional embeddings derived from SSMs.

Only KDA remains to complete the table, and it is Path+Wall PE! It may even be done by now.

Check it out 👇
https://t.co/F6H9qpbZAE https://t.co/MuHp5YCFxD

1

152

20

134

11K

zhxlin retweeted

QianYang 🚀 CVPR @QianYangMila

11 days ago

🧠 New paper: "How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning" Cross-view spatial reasoning is hard for VLMs. Language-only reasoning loses geometry.🧵👇

QianYangMila's tweet photo. 🧠 New paper: "How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning"

Cross-view spatial reasoning is hard for VLMs. Language-only reasoning loses geometry.🧵👇 https://t.co/KF4NdOfNM0

1

17

5

2K

Who to follow

David Dobre

@busycalibrating

PhD in LLM robustness and alignment @Mila_Quebec. Likes mountains.

Vineet Jain

@thevineetjain

Research @Google, PhD @Mila_Quebec @mcgillu | RL & LLMs

Nikita Saxena (she/her)

@nikitasaxena02

Vision @GoogleDeepmind | @WiMLWorkshop | ex-@Mila_Quebec

Zhixuan Lin @zhxlin

12 days ago

A bit surprising to see that FoX doesn't look very good in the comparisons though, especially in needle retrieval... Might be some strange interaction between the training setting and FoX 🤔

0

1

0

293

Zhixuan Lin @zhxlin

12 days ago

Interesting method that applies per-channel multiplicative decay to QK inner products, orthogonal to FoX's additive decay bias. The blog post is also very well written with informative details and ablations!

Tilde

@tilderesearch

12 days ago

https://t.co/rmTk8GMkir

7

362

41

357

89K

1

17

1

11

3K

zhxlin retweeted

Tilde

@tilderesearch

16 days ago

~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales. Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention. 🧵

tilderesearch's tweet photo. ~1/7~Introducing Parallax → a stronger attention variant that achieves a Pareto improvement over vanilla attention at 0.6B and 1.7B scales.

Parallax has better perplexity, better downstream accuracy, and a decode kernel that matches or beats FlashAttention.

🧵 https://t.co/9MOf9QpTrl

8

517

63

422

91K

zhxlin retweeted

Jindong Jiang @JindongJiang

23 days ago

🚀 One of the most exciting features of our Nemotron-Labs-Diffusion is Tri-Mode Support: AR mode for accuracy, diffusion for speed, and self-speculation with diffusion drafting + AR verification for AR-level accuracy at much higher speed. Check out our paper for more details!

0

24

4

2

3K

zhxlin retweeted

Han Guo

@HanGuo97

24 days ago

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

HanGuo97's tweet photo. LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs). https://t.co/cOTeMUr4py

15

685

103

533

198K

zhxlin retweeted

Milad Aghajohari @MAghajohari

about 1 month ago

Excited to see that Markovian Thinker contributed to Zyphra's strong release 🚀. Their Markovian RSA: markovian thinking (carrying forward bounded-length reasoning tails) + RSA (recursive self-aggregation) boosted test-time compute to be on-par with larger reasoning models. 1/

1

50

5

15

7K

zhxlin retweeted

Arshia Afzal

@rshia_afz

about 1 month ago

1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “wisely”? Introducing Raven 🐦‍⬛, the first SSM with selective memory allocation! Raven achieves SOTA performance on recall-heavy tasks with the highest length generalization, extending up to 16× beyond its training sequence length. Raven is a strict upgrade over SWA in the way it stores past context! This is the most elegant model I’ve been involved in designing so far shoutout to @avivbick and @_albertgu for their trust and amazing work! Check out how Raven bridges between SWA and SSM👇

5

270

29

197

277K

zhxlin retweeted

Johan Obando-Ceron 👍🏽

@johanobandoc

about 1 month ago

🥳 Excited to share that our paper "Stable Deep Reinforcement Learning via Isotropic Gaussian Representations" has been accepted at #ICML2026 Spotlight (Top 2.2%)✨ 📄Paper: https://t.co/tUFlyMHeQx 💻Code: https://t.co/xFN1ErauD3 🫶Wonderfull collaboration with @aspa1313, @AaronCourville, @PouyaBashivan and @pcastr. ✈️ See you all in Korea 🇰🇷

3

98

12

22

9K

zhxlin retweeted

Tianwei Ni @twni2016

about 1 month ago

Excited to share that our paper "Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism" has been accepted at ICML 2026! Thanks to everyone who supported and contributed to this work. 📄 https://t.co/ck6Ju0dcQh

0

46

8

26

5K

zhxlin retweeted

Continual RL Workshop @continual_learn

about 1 month ago

📢 Call for papers: Continual RL Workshop @ RLC 2026, Montreal 🗓️ Submission deadline: May 22, 2026 (AoE) 🔗 Website & CFP: https://t.co/JmNXyeyPSL #ReinforcementLearning #ContinualLearning #MachineLearning #RLC2026 #ContinualRL

continual_learn's tweet photo. 📢 Call for papers: Continual RL Workshop @ RLC 2026, Montreal

🗓️ Submission deadline: May 22, 2026 (AoE)
🔗 Website & CFP: https://t.co/JmNXyeyPSL

#ReinforcementLearning #ContinualLearning #MachineLearning #RLC2026 #ContinualRL https://t.co/pnJ176sQNO

0

38

8

4K

zhxlin retweeted

Guozheng Ma

@Guozheng_Ma

about 1 month ago

Our "What Makes Value Learning Efficient in Residual RL?" accepted to #ICML2026 as a ✨Spotlight✨! Value learning silently fails in residual RL. We pinpoint why, and propose 𝐃𝐀𝐖𝐍: a minimal fix that delivers ~5× faster convergence across benchmarks, policies, and modalities. 📄 Preprint: https://t.co/Bm76uflaZS

Guozheng_Ma's tweet photo. Our "What Makes Value Learning Efficient in Residual RL?" accepted to #ICML2026 as a ✨Spotlight✨!

Value learning silently fails in residual RL. We pinpoint why, and propose 𝐃𝐀𝐖𝐍: a minimal fix that delivers ~5× faster convergence across benchmarks, policies, and modalities.

📄 Preprint: https://t.co/Bm76uflaZS

4

49

7

15

4K

zhxlin retweeted

Yihao Sun @Tobealegend24

about 1 month ago

Our paper VLA-MBPO got into ICML! 🎉 Model-based RL has always been the “high potential, painful to tune” corner of RL. But our work pushes a classic MBRL algorithm (MBPO) to a new level: one shared set of hyperparameters sweeps every sim & real-robot env — no per-task tuning.

Tobealegend24's tweet photo. Our paper VLA-MBPO got into ICML! 🎉

Model-based RL has always been the “high potential, painful to tune” corner of RL. But our work pushes a classic MBRL algorithm (MBPO) to a new level: one shared set of hyperparameters sweeps every sim & real-robot env — no per-task tuning. https://t.co/VWVy711rPA

1

45

9

24

2K

zhxlin retweeted

Ai2 @allen_ai

about 1 month ago

Recipes for teaching language models to handle long inputs don't work equally well across model families. We wanted to know why—is it the architecture, the training data, or both? 🧵

allen_ai's tweet photo. Recipes for teaching language models to handle long inputs don't work equally well across model families.

We wanted to know why—is it the architecture, the training data, or both? 🧵 https://t.co/2WyPBZKbEO

5

84

15

63

25K

zhxlin retweeted

André Jonasson @afjonasson

10 months ago

What are those two large-magnitude bands in the activations of the queries and keys of LLMs with rotary positional embeddings? 🧵

afjonasson's tweet photo. What are those two large-magnitude bands in the activations of the queries and keys of LLMs with rotary positional embeddings? 🧵 https://t.co/fHmoOgJUKZ

1

5

1

199

zhxlin retweeted

Huiqiang Jiang @iofu728

about 2 months ago

🌩️Introducing FlashQLA: high-performance linear attention kernels on TileLang. ⚡ 2-3× fwd, 2× bwd speedup. 💻 Purpose-built for agentic on your personal devices. 1. Gate-driven auto intra-card CP. 2. Hardware-friendly reformulation. 3. TileLang fused warp-specialized kernels.

iofu728's tweet photo. 🌩️Introducing FlashQLA: high-performance linear attention kernels on TileLang.
⚡ 2-3× fwd, 2× bwd speedup.
💻 Purpose-built for agentic on your personal devices.

1. Gate-driven auto intra-card CP.
2. Hardware-friendly reformulation.
3. TileLang fused warp-specialized kernels. https://t.co/slPkZDXA60

6

232

34

121

20K

zhxlin retweeted

David Duvenaud

@DavidDuvenaud

about 2 months ago

@geoffreyirving We tried that! The vintage models can just barely start to do simple things with Python, purely from in-context learning:

DavidDuvenaud's tweet photo. @geoffreyirving We tried that! The vintage models can just barely start to do simple things with Python, purely from in-context learning: https://t.co/WxviHfcJRj

25

1K

118

369

192K

zhxlin retweeted

Johan Obando-Ceron 👍🏽

@johanobandoc

about 2 months ago

🔥 The AutoRL workshop is shaping up to be an exciting venue. If your work aligns, we strongly encourage you to submit. Great talks and an exciting panel will be announced soon. #RLC @RL_Conference

0

18

6

1

2K

Zhixuan Lin

@zhxlin

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users