KVCache.AI

Verified account

@KVCache_AI

Hi, this is official account. We build systems for efficient LLM serving, including KTransformers and Mooncake.

Beijing

Joined August 2018

99 Following

337 Followers

38 Posts

5 days ago

🚀 KV Cache Size Calculator update! Thanks to the amazing support from the open-source community, our tool has been widely used and shared. Over the past week, we’ve been adding support for more LLM model families: ✅ DeepSeek V3/R1 ✅ MiMo V2.5 ✅ Qwen3.5 & Qwen3.6 ✅ Cohere ✅ Gemma ✅ Llama Estimate KV cache size with flexible precision settings, transparent formulas, and detailed breakdowns. Try it here: https://t.co/t5PoL9XGl8

KVCache_AI's tweet photo. 🚀 KV Cache Size Calculator update!
Thanks to the amazing support from the open-source community, our tool has been widely used and shared.
Over the past week, we’ve been adding support for more LLM model families:
✅ DeepSeek V3/R1
✅ MiMo V2.5
✅ Qwen3.5 & Qwen3.6
✅ Cohere
✅ Gemma
✅ Llama
Estimate KV cache size with flexible precision settings, transparent formulas, and detailed breakdowns.
Try it here: https://t.co/t5PoL9XGl8

2

14

3

19

985

7 days ago

Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀 Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs. From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference. Full PyTorch blog 👇 https://t.co/jDW0lNsUPd

7 days ago

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

PyTorch's tweet photo. The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs.

In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok

This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

12

287

49

157

272K

1

14

4

3

1K

11 days ago

@ursusspeculus Of course! Cohere Command series are now supported. Feel free to give it a try.

0

2

0

0

47

13 days ago

🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C! Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns. Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax. Try it now: https://t.co/PpN0bvKSVI

KVCache_AI's tweet photo. 🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C!

Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns.

Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax.

Try it now: https://t.co/PpN0bvKSVI https://t.co/dUYu67vhrf

9

137

19

110

47K

11 days ago

@zaafonin Thanks for the feedback! Qwen 3.5/3.6 and Gemma 4 series are now supported. Feel free to give it a try.

0

2

0

0

72

11 days ago

@PurplefinNep Thanks for the feedback! Qwen 3.5 is now supported. Feel free to give it a try.

0

1

0

0

63

12 days ago

Thanks so much for using the KV cache size calculator and for all the great suggestions! We’ve seen the requests for more models. We’ll do our best to add support as soon as possible. Really appreciate all the feedback!

0

2

0

0

230

28 days ago

🚀 Mooncake is proud to support TokenSpeed, a new “speed-of-light” inference engine for agentic workloads!

LightSeek Foundation

28 days ago

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source https://t.co/MJzhCEg7m8 https://t.co/anhoETwwS9

lightseekorg's tweet photo. Introducing TokenSpeed, a speed-of-light LLM inference engine.

> TensorRT LLM level performance

> vLLM level usability

> Built by a lean and mission-driven team in two months

> MIT license, open-source

https://t.co/MJzhCEg7m8

https://t.co/anhoETwwS9 https://t.co/BWn4Me62x7

44

1K

127

926

2M

0

9

3

3

2K

28 days ago

🚀 Mooncake is powering agentic workloads serving with @vllm_project Agentic traces reach 80K+ tokens with highly reusable prefixes. By turning KV cache into a distributed, reusable resource, we eliminate redundant compute and unlock massive gains: 🚀 3.8x higher throughput, ⚡ 46x lower P50 TTFT, 🌐Scales near-linearly to 60 GB200 GPUs at >95% hit rate. Built in close collaboration with @Inferact 🤝

28 days ago

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: https://t.co/XIRtQ9pYVQ 🧵👇

5

228

42

149

55K

0

8

2

1

547

about 2 months ago

Huge milestone for kimi-k2.5-eagle3 reaching 40K downloads on Hugging Face, especially in just two weeks 🚀🚀🚀 It is also a great signal for the growing adoption of speculative decoding in production.

LightSeek Foundation

about 2 months ago

🚀TorchSpec has been live for 2 weeks — and kimi-k2.5-eagle3 just hit 40K downloads on HuggingFace! Thanks to @KT_Project_AI Team and @vllm_project Team for the amazing collaboration. Links in comments.

lightseekorg's tweet photo. 🚀TorchSpec has been live for 2 weeks — and kimi-k2.5-eagle3 just hit 40K downloads on HuggingFace!

Thanks to @KT_Project_AI Team and @vllm_project Team for the amazing collaboration.

Links in comments. https://t.co/2L4xL100L2

2

41

9

8

1M

1

8

2

1

1K

2 months ago

One of the biggest challenges with large-scale EP deployments is the expanding blast radius. Fault tolerance and recovery capabilities are critical for supporting truly large-scale EP, and they are also among the most difficult parts to implement. To address this, the Mooncake and SGLang teams jointly developed Elastic EP. If you’re interested in EP deployments, feel free to give it a try! Details: https://t.co/779HlboIlj

0

4

1

0

219

2 months ago

We’re excited to share our experience in improving the user experience of OpenClaw. By leveraging SGLang HiCache and Mooncake, we not only reduced fast-path latency, but also significantly improved TTFT tail latency. 🔗 Read our latest blog for more details: https://t.co/L1tjVpXXUx

0

2

0

0

140

3 months ago

Great work! Scalable speculative decoding training is an important step forward as models continue to grow in size and context length. Excited to see Mooncake play a key role here by providing efficient and reliable streaming of hidden states, making fully disaggregated inference and training pipelines practical.

3 months ago

We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams. By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated pipelines where inference and training scale independently. 🔗 Read our latest blog from TorchSpec & Mooncake teams: https://t.co/XHF16zD9F9 @lightseekorg @KT_Project_AI #PyTorch #TorchSpec #Mooncake #OpenSourceAI

PyTorch's tweet photo. We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams.

By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated pipelines where inference and training scale independently.

🔗 Read our latest blog from TorchSpec & Mooncake teams: https://t.co/XHF16zD9F9

@lightseekorg @KT_Project_AI

#PyTorch #TorchSpec #Mooncake #OpenSourceAI

1

116

24

47

20K

0

5

3

0

753

3 months ago

Huge congratulations to the @lmsysorg SGLang team and @nvidia on these impressive GB300 results! 🚀 Powerful hardware + excellent software optimization is exactly how you unlock the full potential of long-context inference. Glad that Mooncake, as the KV cache transfer component, could contribute to this milestone. Excited to see what’s next!

3 months ago

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

lmsysorg's tweet photo. 🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference!

Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving:
⚡ 226 TPS/GPU peak throughput (1.53X vs GB200)
🧠 1.87X TPS/User gain with MTP under matched throughput
💾 1.6X higher decode batch size via GB300's 288GB HBM3e
⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP
🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra

Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

lmsysorg's tweet photo. 🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference!

Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving:
⚡ 226 TPS/GPU peak throughput (1.53X vs GB200)
🧠 1.87X TPS/User gain with MTP under matched throughput
💾 1.6X higher decode batch size via GB300's 288GB HBM3e
⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP
🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra

Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

lmsysorg's tweet photo. 🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference!

Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving:
⚡ 226 TPS/GPU peak throughput (1.53X vs GB200)
🧠 1.87X TPS/User gain with MTP under matched throughput
💾 1.6X higher decode batch size via GB300's 288GB HBM3e
⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP
🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra

Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

lmsysorg's tweet photo. 🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference!

Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving:
⚡ 226 TPS/GPU peak throughput (1.53X vs GB200)
🧠 1.87X TPS/User gain with MTP under matched throughput
💾 1.6X higher decode batch size via GB300's 288GB HBM3e
⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP
🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra

Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

7

99

24

34

31K

1

7

1

2

444

4 months ago

⚡ Day-0 support for Qwen3.5-397B-A17B just landed in KTransformers! This beast features Gated Delta Networks + sparse MoE (397B total, 17B active), unified vision-language, and 262K native context. Ready to run on your local machine.

4 months ago

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201 languages & dialects 📜Apache2.0 licensed 🔗Dive in: GitHub: https://t.co/NzNdS9joAT Chat: https://t.co/bg4tAU0Rhw API：https://t.co/YiiyKTnHoU Qwen Code: https://t.co/qqwj5nAger Hugging Face: https://t.co/wFMdX5p5um ModelScope: https://t.co/9NGXcId57a blog: https://t.co/AW8UQStXaL

Alibaba_Qwen's tweet photo. 🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series.

🖼️Native multimodal. Trained for real-world agents.
✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling.
⚡8.6x–19.0x decoding throughput vs Qwen3-Max
🌍201 languages & dialects
📜Apache2.0 licensed

🔗Dive in:
GitHub: https://t.co/NzNdS9joAT
Chat: https://t.co/bg4tAU0Rhw
API：https://t.co/YiiyKTnHoU
Qwen Code: https://t.co/qqwj5nAger
Hugging Face: https://t.co/wFMdX5p5um
ModelScope: https://t.co/9NGXcId57a
blog: https://t.co/AW8UQStXaL

271

5K

860

1K

1M

0

20

4

2

20K

4 months ago

IMPRESSIVE! with such a size of 200~B parameters and 10B activation!

MiniMax (official) @MiniMax_AI

4 months ago

Introducing M2.5, an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution, 37% faster at complex tasks. - At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible MiniMax Agent: https://t.co/aIzrFYcfUz API: https://t.co/fHRdSV7BwZ CodingPlan: https://t.co/FDhZBBjQrX

MiniMax_AI's tweet photo. Introducing M2.5, an open-source frontier model designed for real-world productivity.

- SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.

- Optimized for efficient execution, 37% faster at complex tasks.

- At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible

MiniMax Agent: https://t.co/aIzrFYcfUz
API: https://t.co/fHRdSV7BwZ
CodingPlan: https://t.co/FDhZBBjQrX

454

9K

1K

8K

5M

0

3

0

1

218

4 months ago

Huge congrats to Minimax, this awesome new model is now open-source! KTransformers is happy to provided day0 support for M2.5. You can use KTransformers to enjoy the cutting edge ability of M2.5 with only 1 5090 + 300GB DRAM!

MiniMax (official) @MiniMax_AI

4 months ago

MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows. Hugging Face: https://t.co/zfu7Am7yOg GitHub: https://t.co/uF3FNnb5AX Coding Plan: https://t.co/FDhZBBjQrX Intelligence with Everyone

97

2K

240

675

1M

1

5

1

1

258

4 months ago

@aiktp_com @PyTorch Thanks for your interest and support! Regarding PD disaggregation performance, we have seen very promising improvements. Here are some benchmark results for reference: https://t.co/OQAnJDYWdy https://t.co/tCtCCVwb09

0

0

0

1

42

Last Seen Users on Sotwe

Trends for you

Most Popular Users