Mingxing Zhang

7 days ago

🚀 GLM-5.2 is here — and https://t.co/pP0HtsOYJu is Day 0 ready. 🧠 Stable 1M context 💻 Stronger coding & agent capabilities 📜 MIT-licensed weights ⚡ KTransformers now supports running GLM-5.2 token service on edge devices, powered by SGLang + KT-Kernel. 👉 Get started: https://t.co/UxI4Dt6JWA

0

11

2

597

11 days ago

@KVCache_AI Wondering if your KVCache hit ratio is optimal? Meet the new KVCache Analyzer from @KVCache_AI !

0

2

0

55

james0zan retweeted

Googler/Former ACMER in BUPT/Fond of Programming competitions/Striving for a life outside GFW/Seeking immigration outside of mainland China and North Korea

12 days ago

🚀 We just launched KV Cache Analyzer by https://t.co/EO7MXLjRIs! 📊 Analyze KV cache hit rates and estimate prefill throughput speedup under different cache budgets and eviction policies. 🧪 Use preset traces or your own local traces, choose the model and parameters, and see how KV cache reuse improves LLM inference performance across different settings. 👉 Try it now: https://t.co/3dccW3r657

KVCache_AI's tweet photo. 🚀 We just launched KV Cache Analyzer by https://t.co/EO7MXLjRIs!

📊 Analyze KV cache hit rates and estimate prefill throughput speedup under different cache budgets and eviction policies.

🧪 Use preset traces or your own local traces, choose the model and parameters, and see how KV cache reuse improves LLM inference performance across different settings.

👉 Try it now: https://t.co/3dccW3r657

1

17

6

13

858

Who to follow

daizhenyang

@daizhenyang

Boyang Yang

@buaabarty

Yanshan University, researcher in AI4SE.

Laaaance

@lycanlancelot

Chinese in Australia. PhD in Computer Science. Algorithms. Games. Math (Erdős # = 5).

james0zan retweeted

Matej Sirovatka

@m_sirovatka

22 days ago

KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀

14

337

25

137

31K

Minmin Sun @MinminSun2019

25 days ago

@KVCache_AI Accurately quantifying your KV cache footprint is typically step one for inference optimization. Thanks for the incredible support! We've officially updated our KVCache calculator with more models.

0

5

0

141

james0zan retweeted

28 days ago

Big congrats to the TokenSpeed team & Qwen Inference team! 🙌 This is just chapter one. We’ll keep co-engineering to unlock speed-of-light inference for every Qwen model.

0

10

2

1

2K

james0zan retweeted

Kyle Kranen

@KranenKyle

27 days ago

Cold starts are super painful for scaling LLM workers. Check out our work at restoring inference workers (including AOT traces) in seconds, not 10s of minutes!

1

54

4

16

6K

27 days ago

@KranenKyle Great Job. Looking forward to the future multi-GPU related driver optimizations!

1

11

2

0

969

james0zan retweeted

27 days ago

Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀 Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs. From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference. Full PyTorch blog 👇 https://t.co/jDW0lNsUPd

1

14

4

3

1K

james0zan retweeted

Qwen

@Alibaba_Qwen

28 days ago

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 https://t.co/p04wookcZj #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

39

1K

92

297

594K

james0zan retweeted

PyTorch

@PyTorch

28 days ago

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

PyTorch's tweet photo. The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs.

In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok

This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

12

289

50

157

279K

james0zan retweeted

about 1 month ago

🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C! Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns. Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax. Try it now: https://t.co/PpN0bvKSVI

KVCache_AI's tweet photo. 🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C!

Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns.

Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax.

Try it now: https://t.co/PpN0bvKSVI https://t.co/dUYu67vhrf

9

139

19

111

47K

james0zan retweeted

about 1 month ago

@ApproachingAI @AIatAMD It is a privilege to share Mooncake's recent updates on AMD Developer Day

0

1

0

85

about 1 month ago

@ApproachingAI @AIatAMD It is a privilege to share Mooncake's recent updates on AMD Developer Day

0

1

0

85

james0zan retweeted

RadixArk

@radixark

about 2 months ago

$200 FREE CREDIT! We just launched our inference platform for beta testing, and we're giving it to the community first. ⭐ Star SGLang on GitHub (https://t.co/uEeiF4ANRf) + repost this to claim your credits. → Limited spots, first come first serve → Deadline: May 13, 2025 (AoE) Every star, every issue filed, every PR reviewed, every question answered in Slack — You built this with us. Thank you for believing in open-source AI infrastructure, in our mission, and in us. Claim your credits: https://t.co/MVDvcvkFGX

radixark's tweet photo. $200 FREE CREDIT! We just launched our inference platform for beta testing, and we're giving it to the community first.

⭐ Star SGLang on GitHub (https://t.co/uEeiF4ANRf) + repost this to claim your credits.
→ Limited spots, first come first serve
→ Deadline: May 13, 2025 (AoE)

Every star, every issue filed, every PR reviewed, every question answered in Slack — You built this with us. Thank you for believing in open-source AI infrastructure, in our mission, and in us.

Claim your credits: https://t.co/MVDvcvkFGX

36

344

255

185

83K

james0zan retweeted

about 2 months ago

🚀 Mooncake is proud to support TokenSpeed, a new “speed-of-light” inference engine for agentic workloads!

0

10

3

2K

james0zan retweeted

about 2 months ago

🚀 Mooncake is powering agentic workloads serving with @vllm_project Agentic traces reach 80K+ tokens with highly reusable prefixes. By turning KV cache into a distributed, reusable resource, we eliminate redundant compute and unlock massive gains: 🚀 3.8x higher throughput, ⚡ 46x lower P50 TTFT, 🌐Scales near-linearly to 60 GB200 GPUs at >95% hit rate. Built in close collaboration with @Inferact 🤝

0

8

2

1

569