π vLLM Community Meetup β May 23
https://t.co/lSgpkevhhO joined the vLLM @vllm_project community meetup themed "How will inference acceleration frameworks reshape LLM deployment in the Agentic AI era?" β alongside Red Hat, Inferact, Alibaba, NVIDIA, Moonshot AI, and the open-source community.
Lu Jiahao (https://t.co/lSgpkevhhO tech expert, Mooncake core contributor) co-presented "llm-d + Mooncake: Large-Scale Agentic AI Inference in Practice" with Greg Pereira, core maintainer of llm-d.
ββββββββββ
πΉ The Agentic AI bottleneck
As context lengths grow, traditional compute paradigms face new cost and performance pressure. But multi-turn interactions, tool calls and long-context reuse give Agentic workloads very high KV cache hit rates β a window for "storage-for-compute" trade-offs.
πΉ Mooncake: store more, transfer fast
Transfer Engine (TENT) β evolves from a passive communication library into an active orchestrator. Declarative orchestrationreplaces imperative APIs, delivering low-latency, high-throughput transfer across heterogeneous interconnects.
Mooncake Store β a high-capacity, low-cost SSD tier joins the cache hierarchy. TCO drops, performance holds, and the KV cache poolscales beyond single-machine limits for cross-instance, cross-node context reuse.
πΉ llm-d Γ Mooncake roadmap
Phase 1 β MooncakeConnector: PD disaggregation, P2P transfer, Messages API
Phase 2 β MooncakeStoreConnector: tiered cache offload, Responses API
ββββββββββ
https://t.co/lSgpkevhhO will keep contributing to Mooncake and partner with the community to push inference systems toward a new balance of
scale, performance and cost for the Agentic AI era.
π We just launched the open-source KV Cache Size Calculator by https://t.co/GavTTEDu5C!
Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns.
Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax.
Try it now: https://t.co/PpN0bvKSVI
At AMD Developer Day @AIatAMD π
Tsinghua University Assoc. Prof. and https://t.co/lSgpkevhhO co-initiator Mingxing Zhang @james0zan gave a talk:
βHow Do We Reduce Token Costs in the Agent Era?β
---
The core shift in AI workloads:
We're moving from simple chatbots to complex agent applications.
Each agent task requires dozens of model calls and heavy context I/O, pushing token consumption from 1x to 1000x.
Not all tokens share the same SLO, and compute infrastructure does not equal efficient token production.
Cache, transfer, and token reuse matter as much as the silicon underneath.
---
This thinking shapes Mooncake
https://t.co/jfLpfkN40F
The open-source KV-cache-centric serving substrate for modern LLMs.
Originally focused on KV cache β store more, transfer fast, easy to use β Mooncake has grown into a common
substrate for disaggregated LLM serving:
β PD / EPD / RL disaggregation
β Large-scale elastic EP
β Efficient model weight loading
---
Latest Mooncake highlights from the talk:
β‘ Checkpoint Engine β 1T parameters synced in seconds for distributed RL
β‘ Elastic EP in SGLang β partial failure tolerance for DeepSeek MoE deployments
β‘ vLLM native integration β scaling agentic workloads in production
β‘ Omni Models
β‘ TorchSpec β decoupled speculative decoding framework, streaming hidden states from inference to training
β‘ RL toolchains β TransferQueue / Slime (WIP) / Roll (WIP) for cross-device async non-blocking scheduling
β‘ Production-proven β large-scale deployment validated by RBG + SGLang + Mooncake
---
Thanks @AIatAMD for hosting.
We look forward to deepening collaboration with the AMD developer community on the infrastructure layer of the agent economy.
The goal: let developers work with frontier LLMs on the hardware they already have.Local inference β local fine-tuning. Model adaptation β multi-platform. KTransformers is building more open local LLM infrastructure.
At GOSIM Paris 2026 @gosimfoundation , https://t.co/TNZHDOONnW's Chief Engine Architect Ervin Xie shared what's next for KTransformers: local fine-tuning via KT-SFT (run + tune on one workstation), and broader platform support β including Windows.
5/ Deploy in 3 steps: download weights from HuggingFace β launch SGLang server with KT-Kernel β interact via KT CLI or API. Full guide:
doc/en/DeepSeek-V4-Flash.md
1/ KTransformers now supports DeepSeek-V4 on consumer hardware. Single RTX 5090 + RAM β V4-Flash at 20 tok/s. Two RTX 5090s β V4-Pro. First time a
model at this scale runs end-to-end on consumer GPUs.
4/ New operators: MX_FP4_MOE_TP on CPU (native E2M1 quant + ue8m0 scaling, group_size=32). SM_120 portable MXFP4 MoE kernel on GPU for Blackwell. Full path from quantization format to hardware execution.
Proud to be a launch partner for Kimi K2.6! π
we shipped full support on Day 0οΌ
β Token service liveοΌ
β KTransformers adapted β full-precision inference & fine-tuning with just 48GB VRAM + sufficient RAM
Try it now π https://t.co/D6YauRMrIW
Meet Kimi K2.6: Advancing Open-Source Coding
πΉOpen-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2)
What's new:
πΉLong-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization).
πΉMotion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D.
πΉAgent Swarms, elevated - 300 parallel sub-agents Γ 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files.
πΉProactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops.
πΉClaw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop.
-
K2.6 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
For production-grade coding, pair K2.6 with Kimi Code: https://t.co/uvoSJKyGCY
-
π API: https://t.co/EOZkbOwCN4
π Tech blog: https://t.co/9wWvgIQSS3
π Weights & code: https://t.co/Be0hjs2RTP
Big update from https://t.co/1v5Wy3AVgv:
Dr. Jenny Wu joins as President, bringing experience from Baidu Capital, https://t.co/1l61Wllrja, Morgan Stanley, and New Hope.
Sheβll lead strategy, finance, and global ops as we scale efficient AI inference.
More to come.
#Inference