Albert Gu @_albertgu - Twitter Profile

Pinned Tweet

3 months ago

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

_albertgu's tweet photo. The newest model in the Mamba series is finally here 🐍

Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models.

We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes.

This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

39

2K

311

842

446K

_albertgu retweeted

Eli @elipughresearch

6 days ago

🧵 on some fun insider details on ink-2 😼

2

45

9

7

5K

Albert Gu

@_albertgu

6 days ago

Our new model Ink-2 tops AA's leaderboard for streaming speech-to-text! Ink-2 comes with plenty of features optimized for real-time voice agents. With top-class models for both TTS and STT, the team at @cartesia keeps pushing the frontier of models for interactive intelligence.

Cartesia

@cartesia

6 days ago

Cartesia Ink-2 debuts as #1 for accuracy on the brand-new streaming speech-to-text leaderboard from @ArtificialAnlys! We designed Ink-2 from the ground up for voice agents - with low latency, eager transcripts, and semantic endpointing.

6

120

36

49

59K

6

102

20

13K

_albertgu retweeted

Ronak Malde

@rronak_

7 days ago

Today, @MichaelElabd, @QuantumArjun, and I are excited to announce Trajectory. We are a research lab and product company building the platform for Continual Learning. Our platform unlocks the signal already sitting in product usage, so companies can continuously post-train large-scale agentic models that outperform the frontier. @trajectorylabs We’ve raised $15M from @Conviction, @BessemerVP, @radicalvcfund, @jeffdean, @drfeifei and more. We’re partnering with some of the best AI-native companies: @ClayRunHQ @Harvey, @DecagonAI, @mercor_ai, @RogoAI to power their agentic systems, some of which we are already in production with. We’ve brought together a world class research team from DeepMind, OpenAI, Apple, Meta Superintelligence, Amazon AGI, Scale AI, and an elite product team from Stripe and Figma. AI will never again start on day one. Every correction, every retry, every edit will make products smarter. This is Continual Learning.

244

1K

145

778

2M

Who to follow

Tri Dao

@tri_dao

Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.

Greg Yang

@TheGregYang

xai cofounder. fighting lyme

Percy Liang

@percyliang

professor of computer science @Stanford @stanfordnlp, co-founder of @togethercompute, creator of https://t.co/7R5THVogW2, co-founder of @simile_ai, pianist

Albert Gu

@_albertgu

12 days ago

@NousResearch I don’t understand the second plot. Why are the first 50k steps different from the first plot? What is the loss function there

1

0

130

Albert Gu

@_albertgu

12 days ago

Extremely proud of the team @cartesia for launching Sonic 3.5, which sets a new state of the art for TTS I personally led the technical direction of this model; we built it ground up from first principles, and it contains multiple non-trivial ideas that differ substantially from anything we’ve seen in the literature. It’s been very gratifying to see research bets play out and the strong research team at Cartesia continue to grow!

Artificial Analysis

@ArtificialAnlys

12 days ago

Cartesia’s Sonic-3.5 takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Inworld Realtime TTS 1.5 Max and Google’s Gemini 3.1 Flash TTS Sonic-3.5 is the latest TTS model from @cartesia . It supports 42 languages, including 9 Indian languages, with 500+ voices available out of the box. The model has been highly preferred among voters in the TTS Arena, with its demonstrated naturalness and accurate transcript following. Key takeaways: ➤ Quality: Sonic-3.5 has an Elo score of 1,218 (+16/-16) based on 1,144 arena appearances, placing it ahead of Inworld Realtime TTS 1.5 Max at 1,194 and Gemini 3.1 Flash TTS at 1,209 ➤ Pricing: Sonic-3.5 is priced at $39/1M characters, a premium compared to Gemini 3.1 Flash TTS at $18.3/1M characters, and Inworld Realtime TTS 1.5 Max at $35/1M characters ➤ Speed: 105.5 characters per second, compared to 205 characters per second for Inworld Realtime TTS 1.5 Max and 26.3 characters per second for Gemini 3.1 Flash TTS See more details and listen to samples below 🧵

ArtificialAnlys's tweet photo. Cartesia’s Sonic-3.5 takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Inworld Realtime TTS 1.5 Max and Google’s Gemini 3.1 Flash TTS

Sonic-3.5 is the latest TTS model from @cartesia . It supports 42 languages, including 9 Indian languages, with 500+ voices available out of the box. The model has been highly preferred among voters in the TTS Arena, with its demonstrated naturalness and accurate transcript following.

Key takeaways:
➤ Quality: Sonic-3.5 has an Elo score of 1,218 (+16/-16) based on 1,144 arena appearances, placing it ahead of Inworld Realtime TTS 1.5 Max at 1,194 and Gemini 3.1 Flash TTS at 1,209

➤ Pricing: Sonic-3.5 is priced at $39/1M characters, a premium compared to Gemini 3.1 Flash TTS at $18.3/1M characters, and Inworld Realtime TTS 1.5 Max at $35/1M characters

➤ Speed: 105.5 characters per second, compared to 205 characters per second for Inworld Realtime TTS 1.5 Max and 26.3 characters per second for Gemini 3.1 Flash TTS

See more details and listen to samples below 🧵

18

261

54

89

108K

7

184

18

31

20K

_albertgu retweeted

Tri Dao

@tri_dao

13 days ago

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

18

1K

128

945

130K

_albertgu retweeted

Arshia Afzal

@rshia_afz

17 days ago

Raven is now also available at fla as well! Enjoy playing with it🐦‍⬛. Special thanks to amazing fla team 🎉! https://t.co/HofNAOYUkt

1

30

9

3

4K

_albertgu retweeted

Alisa Liu @alisawuffles

20 days ago

In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.

alisawuffles's tweet photo. In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression. https://t.co/QcdLVASkfg

3

195

23

118

25K

_albertgu retweeted

Thomas G. Dietterich @tdietterich

20 days ago

Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/

140

6K

917

1K

1M

_albertgu retweeted

Yifan Zhang

@yifan_zhang_

22 days ago

Higher-Order Linear Attention Models Are RNNs/SSMs: Generalizing State-Space Duality to higher-order linear attention. It’s getting wild. https://t.co/vUBN3nDFMy

yifan_zhang_'s tweet photo. Higher-Order Linear Attention Models Are RNNs/SSMs:

Generalizing State-Space Duality to higher-order linear attention.

It’s getting wild.

https://t.co/vUBN3nDFMy https://t.co/A651BTCEZc

8

712

108

633

41K

Albert Gu

@_albertgu

27 days ago

Introducing a new sequence model Raven which pushes the boundary of fixed-state-size sequence models! Raven bridges popular linear-time models with constant state capacity, like SSMs and sliding window attention (SWA). Like SWA, its state is a finite set of slots; unlike SWA, Raven learns to selectively choose which slots to update with each new token it caches. This is a much more principled update mechanism that leads to dramatically better retrieval abilities than prior linear models. I personally don't think SWA is a very principled model - but it's convenient and works well empirically - and am most excited to see Raven be used as a strictly better drop-in replacement. More broadly the framework it develops hopefully introduces more ideas to combine the strengths of SSM-like and attention-like models. This work was led by @rshia_afz and @avivbick

Arshia Afzal

@rshia_afz

27 days ago

1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “wisely”? Introducing Raven 🐦‍⬛, the first SSM with selective memory allocation! Raven achieves SOTA performance on recall-heavy tasks with the highest length generalization, extending up to 16× beyond its training sequence length. Raven is a strict upgrade over SWA in the way it stores past context! This is the most elegant model I’ve been involved in designing so far shoutout to @avivbick and @_albertgu for their trust and amazing work! Check out how Raven bridges between SWA and SSM👇

5

267

29

195

276K

4

305

34

200

41K

_albertgu retweeted

Aviv Bick

@avivbick

27 days ago

SSMs fail on recall tasks they have the capacity to solve. The two dominant approaches today, SSMs and sliding-window attention, both lack persistence: memory either decays over time or gets evicted. We built Raven to fix this, surpassing all prior linear models even at 16× their training sequence length. 🧵🐦‍⬛

5

391

58

310

52K

_albertgu retweeted

Will Bui

@will_ea

about 1 month ago

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

will_ea's tweet photo. 27x faster Attention Residuals!!! 🚀

We implemented Block AttnRes as a pip-installable package.

!pip install flash-attn-res

No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.

It just works.

Methodology:
🔹 fused triton kernels
🔹 batched attention over residual blocks
🔹 online-softmax merge
🔹 flash attention-style split-KV reduction

Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

23

763

83

567

75K

Albert Gu

@_albertgu

about 1 month ago

congrats to the team, especially the amazing undergrads who led the project!

Arnav Shah @arnavshah0

about 1 month ago

Excited to announce that dnaHNet has been accepted as an ICML 2026 Spotlight paper! Very grateful to my coauthors @victor_ljz and team, plus our remarkable supervisors @_albertgu and @genophoria.

6

63

6

20

23K

0

60

2

15

10K

Albert Gu

@_albertgu

about 1 month ago

@roydanroy 🤫

1

67

1

6

5K

_albertgu retweeted

Sham Kakade

@ShamKakade6

about 1 month ago

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

ShamKakade6's tweet photo. 1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width. https://t.co/seJrdtszKJ

17

551

70

431

253K

_albertgu retweeted

Sukjun (June) Hwang

@sukjun_hwang

about 1 month ago

I am in Rio for #ICLR2026 🇧🇷 @fluorane @_albertgu and I will be presenting H-Net at [Pavilion3 P3-#1015] 3:15-5:45 today (Thursday). Stop by our poster to see why we’re so excited about the future of H-Net! I will also be happy to talk to new people over the week. Let me know if you‘d like to grab a coffee, DMs open

1

45

5

5K

_albertgu retweeted

Kimi.ai @Kimi_Moonshot

about 1 month ago

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: https://t.co/sf4UohXDWY

45

2K

184

616

213K

Albert Gu

@_albertgu

about 2 months ago

a dynamical systems point of view, which looks like an SSM applied along the residual stream, informs more principled ways to scale looped architectures

Hayden Prairie @hayden_prairie

about 2 months ago

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

hayden_prairie's tweet photo. We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters.

Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size.

Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem!

🧵👇

41

1K

179

1K

294K

0

218

30

104

26K

Albert Gu

@_albertgu

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users