Will Bui @will_ea - Twitter Profile

Pinned Tweet

about 1 month ago

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

will_ea's tweet photo. 27x faster Attention Residuals!!! 🚀

We implemented Block AttnRes as a pip-installable package.

!pip install flash-attn-res

No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.

It just works.

Methodology:
🔹 fused triton kernels
🔹 batched attention over residual blocks
🔹 online-softmax merge
🔹 flash attention-style split-KV reduction

Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

Kimi.ai @Kimi_Moonshot

3 months ago

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: https://t.co/u3EHICG05h

Kimi_Moonshot's tweet photo. Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation.

Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.

🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.

🔗Full report:
https://t.co/u3EHICG05h

336

13K

2K

10K

5M

23

763

83

567

75K

will_ea retweeted

llm_enjoyer

@LLMenjoyer

2 days ago

video of my training runs

4

277

16

73

30K

will_ea retweeted

Yacine Mahdid

@yacinelearning

2 days ago

if you aren’t sleeping at least 9h+ per day you are going to get obliterated by better sleepers like the quality of their thought will just leapfrog you out of existence you’ll be out there yawning away in meetings with your brain clogged up with amyloid-beta them? sharp as an arrow with not a tau in sight

38

668

39

117

54K

will_ea retweeted

llm_enjoyer

@LLMenjoyer

3 days ago

bro the NEURAL CIRCUITS bro!! the SPARSE AUTOENCODERS...!

16

941

67

320

47K

Will Bui

@will_ea

5 days ago

@michellechen the computer history museum nearby is also very lit

0

1

0

91

Will Bui

@will_ea

5 days ago

@Onanaroghene @TheVixhal Learning something deeply requires actively doing something with the information you learned

0

3

0

1

46

will_ea retweeted

Deep-ML

@real_deep_ml

5 days ago

We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step. By the end, you’ll have a working Flash Attention kernel built from the ground up. The project covers: -CUDA primitives warm-up -Matrix operations -Naive attention baseline -Online softmax math -Tiled attention building blocks -Fused Flash Attention kernel -Causal Flash Attention It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.

real_deep_ml's tweet photo. We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step.

By the end, you’ll have a working Flash Attention kernel built from the ground up.

The project covers:
-CUDA primitives warm-up
-Matrix operations
-Naive attention baseline
-Online softmax math
-Tiled attention building blocks
-Fused Flash Attention kernel
-Causal Flash Attention

It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.

18

1K

108

1K

47K

will_ea retweeted

llm_enjoyer

@LLMenjoyer

5 days ago

u js trained on test bro, it's not that deep

15

2K

89

490

196K

Will Bui

@will_ea

5 days ago

@CoreAutoAI not deep enough learning

0

3

0

146

Will Bui

@will_ea

8 days ago

@dogacel0 @ChinmayKak new epoch

1

2

0

42

Will Bui

@will_ea

8 days ago

@Tim_Dettmers The better baseline would have been to use KVTC, which this blog built upon. https://t.co/NeWLQ3lBpn

1

13

1

13

2K

Will Bui

@will_ea

9 days ago

@krishgarg @GoogleDeepMind This needs a KVTC baseline https://t.co/NeWLQ3lBpn

0

12

1

14

2K

Will Bui

@will_ea

10 days ago

@gurishsharma @briar2682 so, no.

0

23

0

242

Will Bui

@will_ea

11 days ago

@MainzOnX glad to hear! it's interesting to hear people talk about using agents and hardly writing code since my experience is that i still have to be heavily involved in editing the code (though i also have really high code quality standards as an oss contributor)

1

0

19

Will Bui

@will_ea

12 days ago

@LLMenjoyer i better see attention residuals in sonic 4 arch

0

1

0

35

Will Bui

@will_ea

12 days ago

@chongz I have always believed in SSMs

0

2

0

80

Will Bui

@will_ea

12 days ago

@rayandabbagh @Meta It's not like she joined and they realized she's an underperformer either, because otherwise, they wouldn't have extended her a return offer otherwise!

1

2

0

100

Will Bui

@will_ea

12 days ago

@josip_ @rayandabbagh @Meta okay how about this? same role same location same yoe

2

0

21

Will Bui

@will_ea

16 days ago

@LLMenjoyer @tri_dao gpt helped me crank out my kernels (w/ a lot of handholding from me ofc) https://t.co/s006fUVmRK

Will Bui

@will_ea

about 1 month ago

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

23

763

83

567

75K

1

3

0

1

247

Will Bui

@will_ea

Last Seen Users on Sotwe

Trends for you

Most Popular Users