Zhiyuan1i @uniartisan - Twitter Profile

Zhiyuan1i @uniartisan

about 2 months ago

@Yulun_Du The fastest KDA kernel in the world today 👀

3

6

1

0

539

Zhiyuan1i @uniartisan

about 2 months ago

❤️❤️❤️

Kimi.ai @Kimi_Moonshot

about 2 months ago

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: https://t.co/sf4UohXDWY

45

2K

183

618

213K

0

5

1

753

Zhiyuan1i @uniartisan

2 months ago

wowwww

Kai

@real_kai42

2 months ago

kimi code 可以申请抢先体验计划了 “bro 不开玩笑，大的真的要来了”

54

213

11

73

178K

0

2

0

775

Zhiyuan1i @uniartisan

2 months ago

🚀🚀

Yu Zhang 🐙🌘

@yzhang_cs

2 months ago

flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. https://t.co/VirlvFzgYc

yzhang_cs's tweet photo. flash-linear-attention is now seeing over 15,000 daily downloads. 📈

We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support.

https://t.co/VirlvFzgYc https://t.co/Q1EjAQl5CC

7

239

27

59

32K

0

3

0

463

uniartisan retweeted

Kimi.ai @Kimi_Moonshot

3 months ago

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: https://t.co/u3EHICG05h

Kimi_Moonshot's tweet photo. Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation.

Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.

🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.

🔗Full report:
https://t.co/u3EHICG05h

336

13K

2K

10K

5M

Zhiyuan1i @uniartisan

3 months ago

@im_datta0 @hu_yifei please share a minimal script. FLA provide multiple ways to accelerate training. Even qwen3.5 itself use FLA. In my opinion, to avoid compile and h2d/d2h is the key

1

0

50

uniartisan retweeted

Dillon Uzar

@DillonUzar

7 months ago

Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards. The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle. kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle. All results at: https://t.co/gLEWzxoXWG The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end. However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level. kimi-linear-48b results: 2-Needle Performance (@ 128k / @ 1M): - AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%) - Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%) 4-Needle Performance (@ 128k / @ 1M): - AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%) - Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%) 8-Needle Performance (@ 128k / @ 1M): - AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%) - Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%) A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR). Enjoy. @Kimi_Moonshot @GoogleDeepMind @googleaidevs @OpenAI @OpenAIDevs

DillonUzar's tweet photo. Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards.

The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle.

kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle.

All results at: https://t.co/gLEWzxoXWG

The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end.

However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level.

kimi-linear-48b results:

2-Needle Performance (@ 128k / @ 1M):
- AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%)
- Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%)

4-Needle Performance (@ 128k / @ 1M):
- AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%)
- Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%)

8-Needle Performance (@ 128k / @ 1M):
- AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%)
- Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%)

A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR).

Enjoy.

@Kimi_Moonshot
@GoogleDeepMind @googleaidevs
@OpenAI @OpenAIDevs

20

466

58

227

295K

Zhiyuan1i @uniartisan

7 months ago

砍完发现我已经订阅过了，求员工优惠啊🤡

黑 @Hx1u0

7 months ago

我发现你们都是砍价之王，就我自己砍不了自己🤡

4

26

0

5K

1

8

0

1

5K

Zhiyuan1i @uniartisan

7 months ago

Can't wait to see them

Lisan al Gaib

@scaling01

7 months ago

from Kimi AMA: - K3 will likely use KDA or some other hybrid attention mechanism - Kimi-K2 will get vision

4

306

19

46

55K

1

7

0

1K

Zhiyuan1i @uniartisan

7 months ago

Serialization and then hashing, I remember even after optimization, 45us was needed. In this case, you can consider exporting the cubin after warming up and calling the cubin directly.

maharshi

@maharshii

7 months ago

why is triton’s kernel launch cpu overhead so freaking high? the actual kernel takes 10x less execution time than to launch it and i can’t use cuda graphs because the shapes are dynamic.

11

127

3

30

42K

0

14

0

1

2K

Zhiyuan1i @uniartisan

7 months ago

KIMI of course 🙋‍♂️

Vishal

@Vixhal

7 months ago

Your current favorite LLM, and why?

1K

3K

107

560

439K

15

225

6

11

10K

Zhiyuan1i @uniartisan

7 months ago

Think deep. Work smart. Focus on the next six months, not the next ten years.

3

102

0

6

4K

Zhiyuan1i @uniartisan

7 months ago

@aj_kourabi The speed issue has been resolved, and the final problem was surprising - everyone's enthusiasm exposed our network bandwidth issue.

1

10

0

345

Zhiyuan1i @uniartisan

7 months ago

@indra_himself You will enjoy Kimi Linear(48B). We build it for coding and provide emotional value. A really warm and smart model.

0

1

0

78

Zhiyuan1i @uniartisan

7 months ago

I believe that we have the best algorithm and engineering teams, and what's important is that they work closely together like a family.

Kimi.ai @Kimi_Moonshot

7 months ago

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on https://t.co/YutVbwktG0 in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: https://t.co/EOZkbOwCN4 🔗 Tech blog: https://t.co/n7xxaszqzF 🔗 Weights & code: https://t.co/4ukcXB0iP6

Kimi_Moonshot's tweet photo. 🚀 Hello, Kimi K2 Thinking!
The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)
🔹 Executes up to 200 – 300 sequential tool calls without human interference
🔹 Excels in reasoning, agentic search, and coding
🔹 256K context window

Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.

K2 Thinking is now live on https://t.co/YutVbwktG0 in chat mode, with full agentic mode coming soon. It is also accessible via API.

🔌 API is live: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/n7xxaszqzF
🔗 Weights & code: https://t.co/4ukcXB0iP6

577

10K

1K

4K

5M

18

505

17

38

43K

Zhiyuan1i @uniartisan

7 months ago

@galuh1300d @deepseek_ai Undoubtedly, I respect and learn from their work. We compete in different aspects, A single flower does not make spring, but a garden full of flowers does.

1

0

77

Zhiyuan1i @uniartisan

7 months ago

Cheers! I'm pleased to see several of our PRs featured here. This will boost the broader hybrid model universe!

PyTorch

@PyTorch

7 months ago

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗 https://t.co/mq5rkwchHk #vLLM #PyTorch #OpenSourceAI #HybridModels

PyTorch's tweet photo. Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.

🔗 https://t.co/mq5rkwchHk

#vLLM #PyTorch #OpenSourceAI #HybridModels

2

151

38

66

73K

1

8

0

1

1K

Zhiyuan1i @uniartisan

7 months ago

@HaveFunStayingP @teortaxesTex Each of us has our own direction, and we are extremely united and happy in our work.

0

42

uniartisan retweeted

Songlin Yang

@SonglinYang4

7 months ago

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimax’s openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud. That said, the “regrets” might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow. Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we don’t want our agentic models to be forever bounded by context length - that’s a limitation we’ll have to overcome sooner or later

13

504

60

204

62K

Zhiyuan1i

@uniartisan

Last Seen Users on Sotwe

Trends for you

Most Popular Users