27x faster Attention Residuals!!! π
We implemented Block AttnRes as a pip-installable package.
!pip install flash-attn-res
No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.
It just works.
Methodology:
πΉ fused triton kernels
πΉ batched attention over residual blocks
πΉ online-softmax merge
πΉ flash attention-style split-KV reduction
Thanks @LLMenjoyer and @cartesia for the support and guidanceβοΈ
Introducing π¨ππππππππ πΉππππ ππππ: Rethinking depth-wise aggregation.
Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.
πΉ Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
πΉ Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
πΉ Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
πΉ Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.
πFull report:
https://t.co/u3EHICG05h
if you arenβt sleeping at least 9h+ per day you are going to get obliterated by better sleepers
like the quality of their thought will just leapfrog you out of existence
youβll be out there yawning away in meetings with your brain clogged up with amyloid-beta
them? sharp as an arrow with not a tau in sight
We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step.
By the end, youβll have a working Flash Attention kernel built from the ground up.
The project covers:
-CUDA primitives warm-up
-Matrix operations
-Naive attention baseline
-Online softmax math
-Tiled attention building blocks
-Fused Flash Attention kernel
-Causal Flash Attention
It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.
@MainzOnX glad to hear! it's interesting to hear people talk about using agents and hardly writing code since my experience is that i still have to be heavily involved in editing the code (though i also have really high code quality standards as an oss contributor)
@rayandabbagh@Meta It's not like she joined and they realized she's an underperformer either, because otherwise, they wouldn't have extended her a return offer otherwise!
27x faster Attention Residuals!!! π
We implemented Block AttnRes as a pip-installable package.
!pip install flash-attn-res
No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.
It just works.
Methodology:
πΉ fused triton kernels
πΉ batched attention over residual blocks
πΉ online-softmax merge
πΉ flash attention-style split-KV reduction
Thanks @LLMenjoyer and @cartesia for the support and guidanceβοΈ