Maximilian Beck @maxmbeck - Twitter Profile

Pinned Tweet

about 1 year ago

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)

maxmbeck's tweet photo. Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧

We introduce

⚡️Tiled Flash Linear Attention (TFLA), ⚡️

A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating.

We find TFLA is really fast!

🧵(1/11) https://t.co/SdGk9OAyhH

3

344

60

208

48K

maxmbeck retweeted

Tilde

@tilderesearch

2 days ago

https://t.co/rmTk8GMkir

7

359

41

355

85K

maxmbeck retweeted

Lukas Aichberger @aichberger

3 days ago

We unlocked the working memory of LLMs 💥 Reasoning in Memory (RiM) replaces autoregressive "thinking out loud" with fixed memory blocks that form a task-specific workspace for latent reasoning. The key idea is simple: reasoning should happen inside the LLM, not in its output!

24

314

52

255

57K

Maximilian Beck @maxmbeck

21 days ago

Life update: A few weeks ago, I moved to Paris 🇫🇷 to start a new position as AI Scientist at Meta FAIR. I am excited about this new chapter and look forward to the opportunities ahead.✨

7

48

0

2K

Who to follow

Philipp Seidl

@phseidl

Postdoc at the IML-JKU Linz. Prev. Intern at MSR Cambridge. Passionate about ML for DD, LLMs, and Zero-shot learning. Opinions are my own and evolving ;)

Thomas Schmied

@thsschmied

PhD student @ JKU Linz, Institute for Machine Learning.

Sebastian Lehner

@sebaLeh

Machine learner at @jkulinz

maxmbeck retweeted

Ai2 @allen_ai

about 1 month ago

Recipes for teaching language models to handle long inputs don't work equally well across model families. We wanted to know why—is it the architecture, the training data, or both? 🧵

allen_ai's tweet photo. Recipes for teaching language models to handle long inputs don't work equally well across model families.

We wanted to know why—is it the architecture, the training data, or both? 🧵 https://t.co/2WyPBZKbEO

5

83

15

60

25K

maxmbeck retweeted

Günter Klambauer @gklambauer

about 1 month ago

# GREAT news!!! 4 papers from our group got accepted at ICML 2026!!! # - 🧬 Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design - 🔁 xLSTM Distillation: Achieving Teacher-Student Parity Through Efficient Hybrid Architectures

1

19

4

5

3K

maxmbeck retweeted

Sepp Hochreiter @HochreiterSepp

about 1 month ago

RNNs like xLSTM with vertically chunked inference strategy for efficient memory: https://t.co/YX6UPapx6Q Chunking enables a linear-time and constant-memory like TFLA for xLSTM https://t.co/1oZu9p3ydO To chunk blocks via recurrent updates and speed up computation considerably.

1

90

14

74

9K

Maximilian Beck @maxmbeck

about 2 months ago

If you want to know more, visit our poster at ICLR: https://t.co/kjYc7beHe4

0

2

0

1

299

Maximilian Beck @maxmbeck

about 2 months ago

We’ve released 35 xLSTM checkpoints from our scaling law study, spanning 160M to 7B parameters and trained on 3B - 1.5T tokens from the DCLM dataset. https://t.co/RR9YC1KvKW

Maximilian Beck @maxmbeck

8 months ago

🚀 Excited to share our new paper on scaling laws for xLSTMs vs. Transformers. Key result: xLSTM models Pareto-dominate Transformers in cross-entropy loss. - At fixed FLOP budgets → xLSTMs perform better - At fixed validation loss → xLSTMs need fewer FLOPs 🧵 Details in thread

maxmbeck's tweet photo. 🚀 Excited to share our new paper on scaling laws for xLSTMs vs. Transformers.
Key result: xLSTM models Pareto-dominate Transformers in cross-entropy loss.
- At fixed FLOP budgets → xLSTMs perform better
- At fixed validation loss → xLSTMs need fewer FLOPs

🧵 Details in thread https://t.co/Rd3s08BtQY

14

230

40

109

85K

2

116

13

56

12K

Maximilian Beck @maxmbeck

about 2 months ago

These checkpoints come from our token-per-parameter training setup and are fully compatible with the xLSTM-7B Hugging Face implementation: https://t.co/hUj3TqJbp8

1

0

334

Maximilian Beck @maxmbeck

2 months ago

@KorbiPoeppel @HochreiterSepp Thanks, Korbi 🙂

0

64

Maximilian Beck @maxmbeck

2 months ago

Now, I’m looking forward to a relaxing Easter break and I’m excited for what comes next 🚀 📄 Thesis: https://t.co/Ai5xDZ44eO 🎤 Defense slides: https://t.co/k0vtzLnsLX

0

11

1

2

973

Maximilian Beck @maxmbeck

2 months ago

👨‍🎓Last week, I successfully defended my PhD thesis - an incredibly exciting and rewarding milestone after 3.5 years of work on xLSTM: Recurrent Neural Network Architectures for Scalable and Efficient Large Language Models

maxmbeck's tweet photo. 👨‍🎓Last week, I successfully defended my PhD thesis - an incredibly exciting and rewarding milestone after 3.5 years of work on

xLSTM: Recurrent Neural Network Architectures
for Scalable and Efficient Large Language Models https://t.co/IK4HzNCNFp

16

137

3

5

9K

Maximilian Beck @maxmbeck

2 months ago

And of course many thanks to @KorbiPoeppel for being an amazing co-author on nearly all xLSTM papers. I also want to thank all collaborators, friends, and family for their support.🤗

1

3

0

341

Maximilian Beck @maxmbeck

3 months ago

Looks Great ! 🔥 Thanks for adding @rasbt

Sebastian Raschka

@rasbt

3 months ago

@maxmbeck Added ✅ https://t.co/NX2sM7aUtc Thanks again!

3

80

9

27

4K

0

7

2

545

maxmbeck retweeted

Niklas Schmidinger

@smdrnks

3 months ago

Excited to share our new paper: Effective Distillation to Hybrid xLSTM Architectures. TL;DR: we retrofit / graft / distill / linearize Transformers into xLSTM-SWA hybrids with fixed-size states. This gives a practical path to studying linear and hybrid architectures starting from already strong pretrained models.

1

15

6

3

1K

Maximilian Beck @maxmbeck

3 months ago

🆕 New xLSTM models! 🔥 ⚗️ This time distilled from Llama, Qwen & Olmo!

Sepp Hochreiter @HochreiterSepp

3 months ago

xLSTM Distillation: https://t.co/iBIJzGbzXX Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.