🚀 KV Cache Size Calculator update!
Thanks to the amazing support from the open-source community, our tool has been widely used and shared.
Over the past week, we’ve been adding support for more LLM model families:
✅ DeepSeek V3/R1
✅ MiMo V2.5
✅ Qwen3.5 & Qwen3.6
✅ Cohere
✅ Gemma
✅ Llama
Estimate KV cache size with flexible precision settings, transparent formulas, and detailed breakdowns.
Try it here: https://t.co/t5PoL9XGl8
Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀
Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs.
From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference.
Full PyTorch blog 👇
https://t.co/jDW0lNsUPd
The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs.
In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok
This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI
🚀 We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C!
Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns.
Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax.
Try it now: https://t.co/PpN0bvKSVI
Thanks so much for using the KV cache size calculator and for all the great suggestions! We’ve seen the requests for more models. We’ll do our best to add support as soon as possible. Really appreciate all the feedback!
Introducing TokenSpeed, a speed-of-light LLM inference engine.
> TensorRT LLM level performance
> vLLM level usability
> Built by a lean and mission-driven team in two months
> MIT license, open-source
https://t.co/MJzhCEg7m8
https://t.co/anhoETwwS9
🚀 Mooncake is powering agentic workloads serving with @vllm_project
Agentic traces reach 80K+ tokens with highly reusable prefixes. By turning KV cache into a distributed, reusable resource, we eliminate redundant compute and unlock massive gains: 🚀 3.8x higher throughput, ⚡ 46x lower P50 TTFT, 🌐Scales near-linearly to 60 GB200 GPUs at >95% hit rate.
Built in close collaboration with @Inferact 🤝
🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake.
Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them.
By integrating Mooncake Store as a distributed KV cache pool, vLLM gets:
🚀 3.8x higher throughput
⚡ 46x lower P50 TTFT
⏱️ 8.6x lower E2E latency
📈 Cache hit rate 1.7% -> 92.2%
🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate
🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI
📖 Read more: https://t.co/XIRtQ9pYVQ
🧵👇
Huge milestone for kimi-k2.5-eagle3 reaching 40K downloads on Hugging Face, especially in just two weeks 🚀🚀🚀
It is also a great signal for the growing adoption of speculative decoding in production.
🚀TorchSpec has been live for 2 weeks — and kimi-k2.5-eagle3 just hit 40K downloads on HuggingFace!
Thanks to @KT_Project_AI Team and @vllm_project Team for the amazing collaboration.
Links in comments.
One of the biggest challenges with large-scale EP deployments is the expanding blast radius. Fault tolerance and recovery capabilities are critical for supporting truly large-scale EP, and they are also among the most difficult parts to implement. To address this, the Mooncake and SGLang teams jointly developed Elastic EP. If you’re interested in EP deployments, feel free to give it a try!
Details: https://t.co/779HlboIlj
We’re excited to share our experience in improving the user experience of OpenClaw. By leveraging SGLang HiCache and Mooncake, we not only reduced fast-path latency, but also significantly improved TTFT tail latency.
🔗 Read our latest blog for more details: https://t.co/L1tjVpXXUx
Great work! Scalable speculative decoding training is an important step forward as models continue to grow in size and context length.
Excited to see Mooncake play a key role here by providing efficient and reliable streaming of hidden states, making fully disaggregated inference and training pipelines practical.
We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams.
By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated pipelines where inference and training scale independently.
🔗 Read our latest blog from TorchSpec & Mooncake teams: https://t.co/XHF16zD9F9
@lightseekorg @KT_Project_AI
#PyTorch #TorchSpec #Mooncake #OpenSourceAI
Huge congratulations to the @lmsysorg SGLang team and @nvidia on these impressive GB300 results! 🚀
Powerful hardware + excellent software optimization is exactly how you unlock the full potential of long-context inference.
Glad that Mooncake, as the KV cache transfer component, could contribute to this milestone. Excited to see what’s next!
🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference!
Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving:
⚡ 226 TPS/GPU peak throughput (1.53X vs GB200)
🧠 1.87X TPS/User gain with MTP under matched throughput
💾 1.6X higher decode batch size via GB300's 288GB HBM3e
⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP
🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra
Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev
⚡ Day-0 support for Qwen3.5-397B-A17B just landed in KTransformers! This beast features Gated Delta Networks + sparse MoE (397B total, 17B active), unified vision-language, and 262K native context. Ready to run on your local machine.
🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series.
🖼️Native multimodal. Trained for real-world agents.
✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling.
⚡8.6x–19.0x decoding throughput vs Qwen3-Max
🌍201 languages & dialects
📜Apache2.0 licensed
🔗Dive in:
GitHub: https://t.co/NzNdS9joAT
Chat: https://t.co/bg4tAU0Rhw
API:https://t.co/YiiyKTnHoU
Qwen Code: https://t.co/qqwj5nAger
Hugging Face: https://t.co/wFMdX5p5um
ModelScope: https://t.co/9NGXcId57a
blog: https://t.co/AW8UQStXaL
Introducing M2.5, an open-source frontier model designed for real-world productivity.
- SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.
- Optimized for efficient execution, 37% faster at complex tasks.
- At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible
MiniMax Agent: https://t.co/aIzrFYcfUz
API: https://t.co/fHRdSV7BwZ
CodingPlan: https://t.co/FDhZBBjQrX
Huge congrats to Minimax, this awesome new model is now open-source! KTransformers is happy to provided day0 support for M2.5. You can use KTransformers to enjoy the cutting edge ability of M2.5 with only 1 5090 + 300GB DRAM!
MiniMax-M2.5 is now open source.
Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows.
Hugging Face: https://t.co/zfu7Am7yOg
GitHub: https://t.co/uF3FNnb5AX
Coding Plan: https://t.co/FDhZBBjQrX
Intelligence with Everyone
@aiktp_com@PyTorch Thanks for your interest and support! Regarding PD disaggregation performance, we have seen very promising improvements. Here are some benchmark results for reference:
https://t.co/OQAnJDYWdy
https://t.co/tCtCCVwb09