Abhinav Moudgil @amoudgl - Twitter Profile

Abhinav Moudgil

@amoudgl

about 2 months ago

@Aflah02101 @iclr_conf This was an open door helicopter ride, we really enjoyed it: https://t.co/D7AmqgoWvk

1

0

288

Abhinav Moudgil

@amoudgl

about 2 months ago

Big kudos to @iclr_conf for organizing #ICLR2026 so well, one of my best conference experiences. Rio is incredible!

1

42

0

7

4K

Abhinav Moudgil

@amoudgl

about 2 months ago

If you’re curious, check out the paper here: https://t.co/W2J3VMiTrO

0

1

0

355

Abhinav Moudgil

@amoudgl

about 2 months ago

So cool to see our Attention to Mamba work among these fantastic papers!

DAIR.AI

@dair_ai

about 2 months ago

The Top AI Papers of the Week (April 19 - 26) - Skill-RAG - DeepSeek V4 - Autogenesis - Attention to Mamba - Stateless Decision Memory - Self-Evolving Logic Synthesis - Self-Generated World Knowledge Read on for more:

8

315

58

284

50K

1

3

1

1K

Who to follow

David Dobre

@busycalibrating

PhD in LLM robustness and alignment @Mila_Quebec. Likes mountains.

Yashaswi pupneja

@YPupneja

Data Scientist @BrainboxAI | MSc @UMontreal @Mila_Quebec | Former SE @dpschool_io | Former Intern @STMicroelectronics | GHCI-19 Student Scholar

Miguel Saavedra

@miguelSaaRuiz

PhD candidate at @UMontreal and @Mila_Quebec. Trying to blend robotics and AI. https://t.co/lKfh4tqpqv

amoudgl retweeted

Paul Janson @janson002

about 2 months ago

PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: https://t.co/NTjBF64jD3

1

11

8

4

2K

Abhinav Moudgil

@amoudgl

about 2 months ago

Work done with @BorisAKnyazev @g_lajoie_ @ebelilov at @Mila_Quebec

0

1

0

324

Abhinav Moudgil

@amoudgl

about 2 months ago

Heading to Rio 🇧🇷 to present our Celo line of work at #ICLR2026! Get in touch if you are curious about new avenues in neural network training or how we scaled learned optimizers from CIFAR-10 to GPT-3 🚀 Details ⬇️

1

17

5

3

2K

Abhinav Moudgil

@amoudgl

about 2 months ago

2. Celo2: Towards Learned Optimization Free Lunch 🗓️ Sat, Apr 25, 2026 • 10:30 AM – 1:00 PM 📍 Pavilion 3 P3-423

1

4

1

428

amoudgl retweeted

DAIR.AI

@dair_ai

2 months ago

NEW paper from Apple. Interesting idea: "Attention to Mamba". The paper introduces a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance. Their trick: first distill the transformer into a linearized-attention student using a kernel adaptation, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost. If you can reliably convert trained transformers into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that decides which architectures actually get deployed in agent stacks. Paper: https://t.co/h7k7OrG8Qj Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. NEW paper from Apple.

Interesting idea: "Attention to Mamba".

The paper introduces a two-stage recipe for cross-architecture distillation from Transformers into Mamba.

Naive distillation collapses teacher performance. Their trick: first distill the transformer into a linearized-attention student using a kernel adaptation, then transfer that student into a pure Mamba with no attention blocks.

On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost.

If you can reliably convert trained transformers into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that decides which architectures actually get deployed in agent stacks.

Paper: https://t.co/h7k7OrG8Qj

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

14

974

142

724

78K

Abhinav Moudgil

@amoudgl

2 months ago

Yes, exactly! We started this project with the goal of understanding representation capacity of Mamba modules and wanted to see how close we can get to Attention-based teacher model with pure Mamba architecture. Our two-stage approach is simpler + more principled and we do initialization with a lot of care so the underlying mechanics of sequence processing / performance of teacher model are retained as much as possible in every stage.

0

1

0

95

Abhinav Moudgil

@amoudgl

2 months ago

Check out the fresh arXiv preprint of work from my Apple internship!

Turing Post

@TheTuringPost

2 months ago

Attention → Mamba cross-architecture distillation is real Transformer doesn't need to stay just a transformer – Apple showed how you can transfer it into a State Space Model (SSM) ▪️ It happens through a linearized-attention intermediate: 1. Distill the Transformer into a linearized attention model using a kernel trick: → Approximate the softmax exponential similarity in attention as a dot product of transformed features. This turns quadratic attention into linear attention. 2. Distill that model into Mamba SSM with proper initialization. This method helps in 2 aspects: - It avoids hybrid architectures - And allows Mamba to reach perplexity 14.11 vs 13.86 for the original Transformer. So the lesson here is not to start Mamba from random weights, but start it from a sequence mixer already aligned with the teacher Transformer.

TheTuringPost's tweet photo. Attention → Mamba cross-architecture distillation is real

Transformer doesn't need to stay just a transformer – Apple showed how you can transfer it into a State Space Model (SSM)

▪️ It happens through a linearized-attention intermediate:

1. Distill the Transformer into a linearized attention model using a kernel trick:
→ Approximate the softmax exponential similarity in attention as a dot product of transformed features. This turns quadratic attention into linear attention.

2. Distill that model into Mamba SSM with proper initialization.

This method helps in 2 aspects:
- It avoids hybrid architectures
- And allows Mamba to reach perplexity 14.11 vs 13.86 for the original Transformer.

So the lesson here is not to start Mamba from random weights, but start it from a sequence mixer already aligned with the teacher Transformer.

7

513

78

350

34K

0

37

4

30

5K

Abhinav Moudgil

@amoudgl

2 months ago

So glad to see this release! Just look at that huge leap in CharXiv figure understanding 🤯 Our tiny team in London worked insanely hard to contribute in this area and even the initial results blew our minds completely. No surprise it's SOTA in this category :)

amoudgl's tweet photo. So glad to see this release! Just look at that huge leap in CharXiv figure understanding 🤯

Our tiny team in London worked insanely hard to contribute in this area and even the initial results blew our minds completely. No surprise it's SOTA in this category :) https://t.co/K4jTVsEDa1

Alexandr Wang

@alexandr_wang

2 months ago

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

alexandr_wang's tweet photo. 1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵 https://t.co/fThDXdsxwB

747

10K

1K

3K

5M

0

5

0

1K

Abhinav Moudgil

@amoudgl

3 months ago

@Antrunt @BorisAKnyazev @ebelilov Thanks! It's the same as in standard optimizers: performance degrades smoothly as you deviate more from optimal LR. 1e-4 or 1/20x tuned AdamW LR are good starting points for Celo2. More on LR sensitivity in appendix:

amoudgl's tweet photo. @Antrunt @BorisAKnyazev @ebelilov Thanks! It's the same as in standard optimizers: performance degrades smoothly as you deviate more from optimal LR. 1e-4 or 1/20x tuned AdamW LR are good starting points for Celo2.

More on LR sensitivity in appendix: https://t.co/h9IgaFa4Z4

0

1

0

347

Abhinav Moudgil

@amoudgl

3 months ago

Introducing Celo2: Towards Learned Optimization Free Lunch We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)! 🧵

amoudgl's tweet photo. Introducing Celo2: Towards Learned Optimization Free Lunch

We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)!

🧵 https://t.co/NuvB4qIzX7

3

103

22

81

9K

Abhinav Moudgil

@amoudgl

3 months ago

Finally, I'd like to acknowledge the Google TPU research cloud program that made this research possible and sincerely thank @mrtnm @kvfrans @_chris_lu_ for their open-source contributions by releasing clean jax codebases with TPU/FSDP support.

1

4

0

957

Abhinav Moudgil

@amoudgl

3 months ago

Work done with @BorisAKnyazev and @ebelilov. If you are interested in this line of research or related topics, our lab is hiring: https://t.co/Z9FUIkY5dR

Eugene Belilovsky

@ebelilov

4 months ago

I have open positions including Postdocs, PhD, master's students, and PhD interns. For more information https://t.co/8KUEDEfvz0

0

4

5

3

2K

1

4

1

0

2K

Abhinav Moudgil

@amoudgl

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users