๐ง ๐๐๐ฐ ๐ข๐ง ๐๐๐๐๐๐ก๐ ๐๐ ๐๐๐ซ๐ฏ๐๐ซ: /๐ซ๐ฎ๐ง_๐ฌ๐๐ซ๐ข๐ฉ๐ญ ๐๐๐ฆ๐ข๐ง ๐๐ง๐๐ฉ๐จ๐ข๐ง๐ญ
One of our core maintainers, ๐ฆ๐๐จ๐๐๐จ๐ฅ๐จ๐ง๐ , introduced /run_script, a new admin endpoint for live debugging and tuning in the LMCache MP server.
With /run_script, developers can inspect runtime state, adjust read/write TTLs, query L1 memory usage, and check server status โ all without restarting or redeploying the server.
Because it can access attributes through app.state.engine, changes such as TTL updates are re-read by the running code and take effect on the next read/write operation.
๐ Read the full beginner-friendly tutorial and implementation details here:https://t.co/z8siE04XQP
KV cache is becoming an independent AI-native data layer โ shared across requests, clusters, and serving systems.
LMCache is proud to help push this frontier forward as an open-source community.
As this space continues to evolve and gain momentum, a new chapter begins for LMCache and the broader KV cache community.
Read more: https://t.co/8wlClx0esv
๐๐ญ๐๐ฒ ๐๐จ๐ง๐ง๐๐๐ญ๐๐ ๐ฐ๐ข๐ญ๐ก ๐๐๐๐๐๐ก๐:
ย ย โข Follow us on LinkedIn: https://t.co/HDqzkXL2CR
ย ย โข Join our Slack community: https://t.co/gt1SrXsnV7
ย ย โข Follow our WeChat Official Account: https://t.co/zSesWi5ulv
#AI #inference #LMCache #KVCache
๐๐ฒ๐ง๐๐ฆ๐จ ๐ฐ๐ข๐ญ๐ก ๐๐๐๐๐๐ก๐ ๐๐ ๐ฆ๐จ๐๐
We've updated the Dynamo integration to support LMCache's new multiprocess(MP) mode, complete with ready-to-run startup scripts. If you're serving with Dynamo, there's now a launch path for running LMCache as an out-of-process sidecar alongside the vLLM backend. Dynamo connects to the sidecar through LMCacheMPConnector, bringing the integration in line with LMCache's newer multiprocess architecture.
Huge thanks to @shaoting_feng for making this possible! Up next: disaggregated serving support for MP mode in Dynamo. Stay tuned! ๐
๐ Explore more: https://t.co/9eoridwMSU
#AI #inference #LMCache #KVCache
๐๐๐ฐ ๐ข๐ง ๐๐๐๐๐๐ก๐: ๐๐ ๐๐๐๐ฉ๐ญ๐๐ซ ๐๐๐ง๐๐ก๐ฆ๐๐ซ๐ค ๐๐๐.
You can now benchmark throughput of an L2 cache adapter directly without starting an inference engine or an LMCache MP server for all of its base operations (store / lookup / load).
The command only requires the adapterโs backing storage to be reachable, making it easier to test and compare L2 backends before plugging them into a full serving workflow.
Try it with the L2 backend that best fits your workflow, whether thatโs local filesystem, Redis, S3, or any other adapter.
Read more and start testing:
https://t.co/SS7u7tzyYP
#AI #inference #LMCache #KVCache
Congrats to @tensormesh for the funding!
Tensormesh is among the major contributors to #LMCache. The investment from @CoreWeave , @nvidia and @AMD (among others) testifies to the important role #LMCache plays in AI infra today and tomorrow.
BTW, Tensormesh is hiring engineers (full-time, part-time or spare-time) to work on LMCache! Shoot an email to [email protected] if you are interested.
Today we announced $20M in new funding from investors including AMD Ventures, CoreWeave, NVentures, Valley Capital Partners, and Laude Ventures, bringing Tensormeshโs total funding to $24.5M.
Weโre also launching Tensormesh Inference into general availability.
AI applications are moving into production, and inference costs are becoming harder to ignore.
Agentic workflows repeatedly process the same prompts, context, conversation history, and tool definitions, driving up API costs on work that has already been done.
Tensormesh changes that with caching-accelerated inference.
Weโre also introducing $0 cached input tokens across Tensormesh serverless deployments, so teams only pay when input tokens need to be processed, not when they can be served from cache.
Read the full announcement: https://t.co/V721yYR8tr
๐๐๐ฅ๐ฅ๐ข๐ง๐ ๐๐ฅ๐ฅ ๐ง๐จ๐ง-๐๐๐๐ ๐ฎ๐ฌ๐๐ซ๐ฌ โ ๐๐๐๐๐๐ก๐ ๐๐ ๐ฆ๐จ๐๐ ๐ง๐จ๐ฐ ๐ซ๐๐๐๐ก๐๐ฌ ๐๐๐ฒ๐จ๐ง๐ ๐๐๐๐!
On non-CUDA devices, LMCache MP can now use ZMQ (instead of CUDA IPC) to send the KV bytes.
LMCache MP mode uses CUDA IPC, but this is not available on non-CUDA devices. To remove that limitation, community contributor ๐ก๐ฅ๐ข๐ง๐๐ added a ๐ง๐จ๐ง-๐๐๐๐ transfer path for CPU, XPU, HPU, and other non-CUDA environments. Since these devices do not support CUDA IPC, the worker sends the actual KV bytes over the message queue instead:
๐๐๐กโ๐๐ ๐๐๐๐๐ ๐พ๐ -> ๐ถ๐๐ ๐โ๐ข๐๐๐ -> ๐ ๐๐๐๐๐๐๐ง๐ ๐ค๐๐กโ ๐๐๐๐๐๐ -> ๐ ๐๐๐ ๐๐ฆ๐ก๐๐ ๐๐ฃ๐๐ ๐๐๐ -> ๐๐๐ ๐๐๐๐๐๐๐ง๐ ๐๐ ๐กโ๐ ๐ ๐๐๐ฃ๐๐ -> ๐ค๐๐๐ก๐ ๐ก๐ ๐ฟ1
On CUDA devices, LMCache continues to use the existing CUDA IPC path, where the worker sends a lightweight handle and the server reads the workerโs GPU memory directly:
๐ค๐๐๐๐๐ ๐๐๐๐๐ ๐พ๐ (๐บ๐๐) -> ๐ฟ๐๐ถ๐๐โ๐ ๐๐๐๐๐ ๐ฃ๐๐ ๐ถ๐๐ท๐ด ๐ผ๐๐ถ -> ๐บ๐๐ ๐ ๐ก๐๐๐๐๐ ๐๐ข๐๐๐๐ -> ๐ฟ1 ๐๐๐โ๐ (๐ถ๐๐ ๐ ๐ด๐)
In both paths, ZMQ serves as the control channel and carries messages such as REGISTER, PREPARE_STORE, and COMMIT_STORE.
Compared with the CUDA path, the non-CUDA path adds two CPU-side copies, but ๐๐ฑ๐ญ๐๐ง๐๐ฌ ๐๐ ๐ฆ๐จ๐๐ ๐ญ๐จ ๐ง๐จ๐ง-๐๐๐๐ environments.
#KVCache #LMCache #AI #inference
New blog: ๐๐ก๐๐ง ๐๐ฉ๐๐ง ๐๐จ๐ฎ๐ซ๐๐ ๐๐๐๐ญ๐ฌ ๐๐ฉ๐๐ง ๐๐จ๐ฎ๐ซ๐๐ โ ๐ ๐๐จ๐ข๐ง๐ญ ๐๐๐๐จ๐ซ๐ญ ๐๐๐ญ๐ฐ๐๐๐ง ๐๐๐๐๐๐ก๐ ๐๐ง๐ ๐๐จ๐จ๐ง๐๐๐ค๐
The story starts with the LMCache community building the foundation: the native connector framework, dynamic plugin loading, and the MooncakeStore L2 plugin path for MP mode.
The Mooncake community then helped optimize the RDMA path step by step, adding L1 memory preregistration, batch operations, and dedicated worker lanes for different cache operations. Under Mooncake RDMA, ๐ญ๐ก๐ข๐ฌ ๐ฐ๐จ๐ซ๐ค๐๐ซ-๐ฅ๐๐ง๐ ๐๐๐ฌ๐ข๐ ๐ง ๐ซ๐๐๐ฎ๐๐๐ ๐ฅ๐จ๐จ๐ค๐ฎ๐ฉ ๐ฉ๐๐ ๐๐ซ๐จ๐ฆ ๐๐.๐ ๐ฆ๐ฌ ๐ญ๐จ ๐.๐๐ ๐ฆ๐ฌ!
This was not a one-sided integration. LMCache brought the MP framework and native connector abstraction and Mooncake brought deep storage and RDMA expertise. Together, the two communities built a stronger L2 KV cache integration for distributed LLM inference systems.
Huge thanks to maobaolong, fangchizheng, chunxiaozheng, and everyone in both communities who helped make this happen!
Read the full story:
https://t.co/K9nqWpLjsE
#KVCache #LMCache #AI #inference
PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache.
In Prefill-Decode Disaggregation, a single LLM request is split across two types of nodes. A prefill node reads the prompt and produces the KV cache, while a decode node consumes that KV cache to generate tokens. The KV cache needs to move from the prefill node to the decode node over the network, typically through RDMA. In LMCache, the component responsible for moving these KV chunks is called the PDBackend.
Before the asynchronous PDBackend, LMCacheโs prefill workers sent KV cache chunks one at a time and waited for each transfer to finish before continuing. This worked for simple cases, but under chunked prefill, where a long prompt is split into multiple KV transfers, concurrent requests could deadlock.
The new fully asynchronous PDBackend moves KV transfer off the critical path. Instead of blocking on each network transfer, the prefill worker can hand off KV chunks in the background and continue processing the next prompt. On the receiver side, LMCache also reserves enough buffer space for the whole request before the transfer starts, so each admitted request has enough room to finish.
This update is a great community effort from LMCache. As Prefill-Decode Disaggregation becomes more widely used, improvements like async PDBackend are essential for making KV cache transfer more reliable and scalable. Thank you to everyone in the LMCache community who helped shape, review, and harden this update!
#KVCache #LMCache #AI #inference
๐๐ก๐ ๐๐ฉ๐๐ง๐๐-๐๐จ๐ฆ๐ฉ๐๐ญ๐ข๐๐ฅ๐ ๐๐๐ ๐ข๐ฌ ๐๐๐๐จ๐ฆ๐ข๐ง๐ ๐ญ๐ก๐ ๐๐๐ฏ๐ ๐จ๐ ๐๐๐ ๐ฌ๐ฒ๐ฌ๐ญ๐๐ฆ๐ฌ. โณ
At the top: agents, RAG, chatbots, tools, and workflows.
In the middle: the OpenAI-compatible API.
Below: routing, batching, scheduling, KV cache, inference engines, and hardware.
It's the familiar network-layering pattern, where IPv4 acts as the narrow waist between everything above and below it. This shared interface lets applications run across providers while inference backends optimize underneath.
But the narrow waist also creates a blind spot. Once modern LLM applications cross the API boundary, much of their structure becomes just a sequence of tokens.
That's why the next wave of LLM infrastructure may depend on what happens below the API: better scheduling, better cache reuse, and better AI-native memory systems.
๐ Read more in Junchen Jiang's new blog, "๐๐ฉ๐๐ง๐๐ ๐๐๐ ๐๐ฌ ๐ญ๐ก๐ ๐๐๐ฐ ๐๐๐ฏ๐":
https://t.co/lZQhGNtP8K
#AI #Inference #LMCache #KVCache #Network
๐๐๐๐๐๐ก๐ ๐ฏ๐.๐.๐ ๐ข๐ฌ ๐จ๐ฎ๐ญ! ๐
This release was a massive community effort. A huge shoutout to our 34 contributors who shipped 119 commits! Whether you submitted a PR, reported an issue, or joined the discussionsโthank you for making this possible!
๐ ๐๐ก๐๐ญ'๐ฌ ๐๐๐ฐ:
๐น DeepSeek V4 support
๐น TensorRT-LLM integration
๐น Hardware support for AMD ROCm & Intel HPU
๐น SERDE support for pluggable KV cache transformations
โ ๏ธ ๐๐๐๐๐ฌ-๐ฎ๐ฉ: Our default CUDA wheel is now cu13 (cu12.9 is still available).
๐ ๐๐๐๐ ๐ญ๐ก๐ ๐๐ฎ๐ฅ๐ฅ ๐ซ๐๐ฅ๐๐๐ฌ๐ ๐ง๐จ๐ญ๐๐ฌ: https://t.co/gBMTMSldzx
Thank you for being such an incredible community. We can't wait to see what we build together next!
#AI #LLM #Inference #LMCache
Cache misses happen for a couple of reasons:
1. not enough KV Cache capacity
2. context changes: tool calls, system prompt, messages
3. model changes.
For your own deployment, you can avoid all 3 (without changing the way you interact with your agent) via:
1. tuning your KV Cache Store size: https://t.co/Gz1PiHoRLc
2. CachBlend for non prefix reuse: https://t.co/WaRjs34N94
3. DroidSpeak for cross-LLM resuse: https://t.co/zmnZggEZR1
Cache misses happen for a couple of reasons:
1. not enough KV Cache capacity
2. context changes: tool calls, system prompt, messages
3. model changes.
For your own deployment, you can avoid all 3 (without changing the way you interact with your agent) via:
1. tuning your KV Cache Store size: https://t.co/Gz1PiHoRLc
2. CachBlend for non prefix reuse: https://t.co/WaRjs34N94
3. DroidSpeak for cross-LLM resuse: https://t.co/zmnZggEZR1
Prompt cache diagnostics are now in Claude Console.
When a request misses the cache, you can now see exactly which part of your prompt changed and how many tokens it cost you.
๐๐๐๐๐๐ก๐ ๐ง๐จ๐ฐ ๐ฌ๐ฎ๐ฉ๐ฉ๐จ๐ซ๐ญ๐ฌ ๐๐๐ง๐ฌ๐จ๐ซ๐๐-๐๐๐, alongside vLLM and SGLang! ๐
With this integration, TensorRT-LLM can use LMCache for KV cache lookup, retrieve, and store during the request lifecycle. In our recommended ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฉ๐ซ๐จ๐๐๐ฌ๐ฌ (๐๐) ๐ฆ๐จ๐๐, the engines talk to a standalone LMCache server, enabling shared KV cache management across multiple TRT-LLM workers on the same node.
The main engineering difference is TensorRT-LLMโs KV memory layout. Unlike vLLM and SGLang, which commonly expose KV cache in a layer-oriented layout, TensorRT-LLM packs multiple layers within shared KV cache blocks for efficient GPU access. LMCache now understands this packed layout and can efficiently read and write TensorRT-LLM KV cache.
This brings LMCacheโs KV reuse and multi-tier cache capabilities to TensorRT-LLM, connecting it to LMCacheโs broad ecosystem.
Start here:
https://t.co/wTNcmpqhYm
Explore validated recipes for models deployed with TensorRT-LLM and help us expand coverage โ a great starting PR for new contributors:
https://t.co/YWTMub9V6R
#AI #LLM #Inference #LMCache #NVIDIA #vLLM #SGLang