We're open-sourcing FlashKDA โ our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72รโ2.22ร prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention.
Explore on github: https://t.co/sf4UohXDWY
flash-linear-attention is now seeing over 15,000 daily downloads. ๐
We @SonglinYang4@uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support.
https://t.co/VirlvFzgYc
Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation.
Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.
๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.
๐Full report:
https://t.co/u3EHICG05h
@im_datta0@hu_yifei please share a minimal script. FLA provide multiple ways to accelerate training. Even qwen3.5 itself use FLA. In my opinion, to avoid compile and h2d/d2h is the key
Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards.
The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle.
kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle.
All results at: https://t.co/gLEWzxoXWG
The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end.
However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level.
kimi-linear-48b results:
2-Needle Performance (@ 128k / @ 1M):
- AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%)
- Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%)
4-Needle Performance (@ 128k / @ 1M):
- AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%)
- Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%)
8-Needle Performance (@ 128k / @ 1M):
- AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%)
- Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%)
A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR).
Enjoy.
@Kimi_Moonshot@GoogleDeepMind@googleaidevs@OpenAI@OpenAIDevs
Serialization and then hashing, I remember even after optimization, 45us was needed. In this case, you can consider exporting the cubin after warming up and calling the cubin directly.
why is tritonโs kernel launch cpu overhead so freaking high? the actual kernel takes 10x less execution time than to launch it and i canโt use cuda graphs because the shapes are dynamic.
๐ Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.
๐น SOTA on HLE (44.9%) and BrowseComp (60.2%)
๐น Executes up to 200 โ 300 sequential tool calls without human interference
๐น Excels in reasoning, agentic search, and coding
๐น 256K context window
Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling โ scaling both thinking tokens and tool-calling turns.
K2 Thinking is now live on https://t.co/YutVbwktG0 in chat mode, with full agentic mode coming soon. It is also accessible via API.
๐ API is live: https://t.co/EOZkbOwCN4
๐ Tech blog: https://t.co/n7xxaszqzF
๐ Weights & code: https://t.co/4ukcXB0iP6
@galuh1300d@deepseek_ai Undoubtedly, I respect and learn from their work. We compete in different aspects, A single flower does not make spring, but a garden full of flowers does.
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.
๐ https://t.co/mq5rkwchHk
#vLLM #PyTorch #OpenSourceAI #HybridModels
Many people are confused by Minimaxโs recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimiโs later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimaxโs openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud.
That said, the โregretsโ might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow.
Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we donโt want our agentic models to be forever bounded by context length - thatโs a limitation weโll have to overcome sooner or later