LMCache Lab

@lmcache

🧪 Open-Source Team that maintains LMCache and Production Stack 🤖 Democratizing AI by providing efficient LLM serving for ALL

Github, Online

Joined September 2024

50 Following

962 Followers

207 Posts

LMCache Lab

@lmcache

about 5 hours ago

🔧 𝐍𝐞𝐰 𝐢𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐒𝐞𝐫𝐯𝐞𝐫: /𝐫𝐮𝐧_𝐬𝐜𝐫𝐢𝐩𝐭 𝐀𝐝𝐦𝐢𝐧 𝐄𝐧𝐝𝐩𝐨𝐢𝐧𝐭 One of our core maintainers, 𝐦𝐚𝐨𝐛𝐚𝐨𝐥𝐨𝐧𝐠, introduced /run_script, a new admin endpoint for live debugging and tuning in the LMCache MP server. With /run_script, developers can inspect runtime state, adjust read/write TTLs, query L1 memory usage, and check server status — all without restarting or redeploying the server. Because it can access attributes through app.state.engine, changes such as TTL updates are re-read by the running code and take effect on the next read/write operation. 📖 Read the full beginner-friendly tutorial and implementation details here:https://t.co/z8siE04XQP

lmcache's tweet photo. 🔧 𝐍𝐞𝐰 𝐢𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐒𝐞𝐫𝐯𝐞𝐫: /𝐫𝐮𝐧_𝐬𝐜𝐫𝐢𝐩𝐭 𝐀𝐝𝐦𝐢𝐧 𝐄𝐧𝐝𝐩𝐨𝐢𝐧𝐭

One of our core maintainers, 𝐦𝐚𝐨𝐛𝐚𝐨𝐥𝐨𝐧𝐠, introduced /run_script, a new admin endpoint for live debugging and tuning in the LMCache MP server.

With /run_script, developers can inspect runtime state, adjust read/write TTLs, query L1 memory usage, and check server status — all without restarting or redeploying the server.

Because it can access attributes through app.state.engine, changes such as TTL updates are re-read by the running code and take effect on the next read/write operation.

📖 Read the full beginner-friendly tutorial and implementation details here:https://t.co/z8siE04XQP

LMCache Lab

@lmcache

about 7 hours ago

Update our Slack invitation link: https://t.co/AoXYVWoteJ This one should never expire 🫡

LMCache Lab

@lmcache

1 day ago

KV cache is becoming an independent AI-native data layer — shared across requests, clusters, and serving systems. LMCache is proud to help push this frontier forward as an open-source community. As this space continues to evolve and gain momentum, a new chapter begins for LMCache and the broader KV cache community. Read more: https://t.co/8wlClx0esv 𝐒𝐭𝐚𝐲 𝐜𝐨𝐧𝐧𝐞𝐜𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐋𝐌𝐂𝐚𝐜𝐡𝐞: • Follow us on LinkedIn: https://t.co/HDqzkXL2CR • Join our Slack community: https://t.co/gt1SrXsnV7 • Follow our WeChat Official Account: https://t.co/zSesWi5ulv #AI #inference #LMCache #KVCache

160

LMCache Lab

@lmcache

2 days ago

𝐃𝐲𝐧𝐚𝐦𝐨 𝐰𝐢𝐭𝐡 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐦𝐨𝐝𝐞 We've updated the Dynamo integration to support LMCache's new multiprocess(MP) mode, complete with ready-to-run startup scripts. If you're serving with Dynamo, there's now a launch path for running LMCache as an out-of-process sidecar alongside the vLLM backend. Dynamo connects to the sidecar through LMCacheMPConnector, bringing the integration in line with LMCache's newer multiprocess architecture. Huge thanks to @shaoting_feng for making this possible! Up next: disaggregated serving support for MP mode in Dynamo. Stay tuned! 🚀 👉 Explore more: https://t.co/9eoridwMSU #AI #inference #LMCache #KVCache

lmcache's tweet photo. 𝐃𝐲𝐧𝐚𝐦𝐨 𝐰𝐢𝐭𝐡 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐦𝐨𝐝𝐞

We've updated the Dynamo integration to support LMCache's new multiprocess(MP) mode, complete with ready-to-run startup scripts. If you're serving with Dynamo, there's now a launch path for running LMCache as an out-of-process sidecar alongside the vLLM backend. Dynamo connects to the sidecar through LMCacheMPConnector, bringing the integration in line with LMCache's newer multiprocess architecture.

Huge thanks to @shaoting_feng for making this possible! Up next: disaggregated serving support for MP mode in Dynamo. Stay tuned! 🚀

👉 Explore more: https://t.co/9eoridwMSU

#AI #inference #LMCache #KVCache

175

LMCache Lab

@lmcache

6 days ago

𝐍𝐞𝐰 𝐢𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞: 𝐋𝟐 𝐚𝐝𝐚𝐩𝐭𝐞𝐫 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐂𝐋𝐈. You can now benchmark throughput of an L2 cache adapter directly without starting an inference engine or an LMCache MP server for all of its base operations (store / lookup / load). The command only requires the adapter’s backing storage to be reachable, making it easier to test and compare L2 backends before plugging them into a full serving workflow. Try it with the L2 backend that best fits your workflow, whether that’s local filesystem, Redis, S3, or any other adapter. Read more and start testing: https://t.co/SS7u7tzyYP #AI #inference #LMCache #KVCache

lmcache's tweet photo. 𝐍𝐞𝐰 𝐢𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞: 𝐋𝟐 𝐚𝐝𝐚𝐩𝐭𝐞𝐫 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐂𝐋𝐈.

You can now benchmark throughput of an L2 cache adapter directly without starting an inference engine or an LMCache MP server for all of its base operations (store / lookup / load).

The command only requires the adapter’s backing storage to be reachable, making it easier to test and compare L2 backends before plugging them into a full serving workflow.

Try it with the L2 backend that best fits your workflow, whether that’s local filesystem, Redis, S3, or any other adapter.

Read more and start testing:
https://t.co/SS7u7tzyYP

#AI #inference #LMCache #KVCache

149

LMCache Lab

@lmcache

7 days ago

Congrats to @tensormesh for the funding! Tensormesh is among the major contributors to #LMCache. The investment from @CoreWeave , @nvidia and @AMD (among others) testifies to the important role #LMCache plays in AI infra today and tomorrow. BTW, Tensormesh is hiring engineers (full-time, part-time or spare-time) to work on LMCache! Shoot an email to [email protected] if you are interested.

Tensormesh

@tensormesh

7 days ago

Today we announced $20M in new funding from investors including AMD Ventures, CoreWeave, NVentures, Valley Capital Partners, and Laude Ventures, bringing Tensormesh’s total funding to $24.5M. We’re also launching Tensormesh Inference into general availability. AI applications are moving into production, and inference costs are becoming harder to ignore. Agentic workflows repeatedly process the same prompts, context, conversation history, and tool definitions, driving up API costs on work that has already been done. Tensormesh changes that with caching-accelerated inference. We’re also introducing $0 cached input tokens across Tensormesh serverless deployments, so teams only pay when input tokens need to be processed, not when they can be served from cache. Read the full announcement: https://t.co/V721yYR8tr

tensormesh's tweet photo. Today we announced $20M in new funding from investors including AMD Ventures, CoreWeave, NVentures, Valley Capital Partners, and Laude Ventures, bringing Tensormesh’s total funding to $24.5M.

We’re also launching Tensormesh Inference into general availability.

AI applications are moving into production, and inference costs are becoming harder to ignore.

Agentic workflows repeatedly process the same prompts, context, conversation history, and tool definitions, driving up API costs on work that has already been done.

Tensormesh changes that with caching-accelerated inference.

We’re also introducing $0 cached input tokens across Tensormesh serverless deployments, so teams only pay when input tokens need to be processed, not when they can be served from cache.

Read the full announcement: https://t.co/V721yYR8tr

372

LMCache Lab

@lmcache

7 days ago

𝐂𝐚𝐥𝐥𝐢𝐧𝐠 𝐚𝐥𝐥 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 𝐮𝐬𝐞𝐫𝐬 — 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐧𝐨𝐰 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐛𝐞𝐲𝐨𝐧𝐝 𝐂𝐔𝐃𝐀! On non-CUDA devices, LMCache MP can now use ZMQ (instead of CUDA IPC) to send the KV bytes. LMCache MP mode uses CUDA IPC, but this is not available on non-CUDA devices. To remove that limitation, community contributor 𝐡𝐥𝐢𝐧𝟗𝟗 added a 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 transfer path for CPU, XPU, HPU, and other non-CUDA environments. Since these devices do not support CUDA IPC, the worker sends the actual KV bytes over the message queue instead: 𝑔𝑎𝑡ℎ𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 -> 𝐶𝑃𝑈 𝑐ℎ𝑢𝑛𝑘𝑠 -> 𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑤𝑖𝑡ℎ 𝑝𝑖𝑐𝑘𝑙𝑒 -> 𝑠𝑒𝑛𝑑 𝑏𝑦𝑡𝑒𝑠 𝑜𝑣𝑒𝑟 𝑍𝑀𝑄 -> 𝑑𝑒𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑟𝑣𝑒𝑟 -> 𝑤𝑟𝑖𝑡𝑒 𝑡𝑜 𝐿1 On CUDA devices, LMCache continues to use the existing CUDA IPC path, where the worker sends a lightweight handle and the server reads the worker’s GPU memory directly: 𝑤𝑜𝑟𝑘𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 (𝐺𝑃𝑈) -> 𝐿𝑀𝐶𝑎𝑐ℎ𝑒 𝑟𝑒𝑎𝑑𝑠 𝑣𝑖𝑎 𝐶𝑈𝐷𝐴 𝐼𝑃𝐶 -> 𝐺𝑃𝑈 𝑠𝑡𝑎𝑔𝑖𝑛𝑔 𝑏𝑢𝑓𝑓𝑒𝑟 -> 𝐿1 𝑐𝑎𝑐ℎ𝑒 (𝐶𝑃𝑈 𝑅𝐴𝑀) In both paths, ZMQ serves as the control channel and carries messages such as REGISTER, PREPARE_STORE, and COMMIT_STORE. Compared with the CUDA path, the non-CUDA path adds two CPU-side copies, but 𝐞𝐱𝐭𝐞𝐧𝐝𝐬 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐭𝐨 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 environments. #KVCache #LMCache #AI #inference

lmcache's tweet photo. 𝐂𝐚𝐥𝐥𝐢𝐧𝐠 𝐚𝐥𝐥 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 𝐮𝐬𝐞𝐫𝐬 — 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐧𝐨𝐰 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐛𝐞𝐲𝐨𝐧𝐝 𝐂𝐔𝐃𝐀!

On non-CUDA devices, LMCache MP can now use ZMQ (instead of CUDA IPC) to send the KV bytes.

LMCache MP mode uses CUDA IPC, but this is not available on non-CUDA devices. To remove that limitation, community contributor 𝐡𝐥𝐢𝐧𝟗𝟗 added a 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 transfer path for CPU, XPU, HPU, and other non-CUDA environments. Since these devices do not support CUDA IPC, the worker sends the actual KV bytes over the message queue instead:

𝑔𝑎𝑡ℎ𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 -> 𝐶𝑃𝑈 𝑐ℎ𝑢𝑛𝑘𝑠 -> 𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑤𝑖𝑡ℎ 𝑝𝑖𝑐𝑘𝑙𝑒 -> 𝑠𝑒𝑛𝑑 𝑏𝑦𝑡𝑒𝑠 𝑜𝑣𝑒𝑟 𝑍𝑀𝑄 -> 𝑑𝑒𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑟𝑣𝑒𝑟 -> 𝑤𝑟𝑖𝑡𝑒 𝑡𝑜 𝐿1

On CUDA devices, LMCache continues to use the existing CUDA IPC path, where the worker sends a lightweight handle and the server reads the worker’s GPU memory directly:

𝑤𝑜𝑟𝑘𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 (𝐺𝑃𝑈) -> 𝐿𝑀𝐶𝑎𝑐ℎ𝑒 𝑟𝑒𝑎𝑑𝑠 𝑣𝑖𝑎 𝐶𝑈𝐷𝐴 𝐼𝑃𝐶 -> 𝐺𝑃𝑈 𝑠𝑡𝑎𝑔𝑖𝑛𝑔 𝑏𝑢𝑓𝑓𝑒𝑟 -> 𝐿1 𝑐𝑎𝑐ℎ𝑒 (𝐶𝑃𝑈 𝑅𝐴𝑀)

In both paths, ZMQ serves as the control channel and carries messages such as REGISTER, PREPARE_STORE, and COMMIT_STORE.

Compared with the CUDA path, the non-CUDA path adds two CPU-side copies, but 𝐞𝐱𝐭𝐞𝐧𝐝𝐬 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐭𝐨 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 environments.

#KVCache #LMCache #AI #inference

171

LMCache Lab

@lmcache

8 days ago

You can also enjoy the Chinese version here: https://t.co/OcEqzI38Hu

140

LMCache Lab

@lmcache

8 days ago

New blog: 𝐖𝐡𝐞𝐧 𝐎𝐩𝐞𝐧 𝐒𝐨𝐮𝐫𝐜𝐞 𝐌𝐞𝐞𝐭𝐬 𝐎𝐩𝐞𝐧 𝐒𝐨𝐮𝐫𝐜𝐞 — 𝐀 𝐉𝐨𝐢𝐧𝐭 𝐄𝐟𝐟𝐨𝐫𝐭 𝐁𝐞𝐭𝐰𝐞𝐞𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐚𝐧𝐝 𝐌𝐨𝐨𝐧𝐜𝐚𝐤𝐞 The story starts with the LMCache community building the foundation: the native connector framework, dynamic plugin loading, and the MooncakeStore L2 plugin path for MP mode. The Mooncake community then helped optimize the RDMA path step by step, adding L1 memory preregistration, batch operations, and dedicated worker lanes for different cache operations. Under Mooncake RDMA, 𝐭𝐡𝐢𝐬 𝐰𝐨𝐫𝐤𝐞𝐫-𝐥𝐚𝐧𝐞 𝐝𝐞𝐬𝐢𝐠𝐧 𝐫𝐞𝐝𝐮𝐜𝐞𝐝 𝐥𝐨𝐨𝐤𝐮𝐩 𝐩𝟗𝟗 𝐟𝐫𝐨𝐦 𝟏𝟔.𝟖 𝐦𝐬 𝐭𝐨 𝟎.𝟒𝟖 𝐦𝐬! This was not a one-sided integration. LMCache brought the MP framework and native connector abstraction and Mooncake brought deep storage and RDMA expertise. Together, the two communities built a stronger L2 KV cache integration for distributed LLM inference systems. Huge thanks to maobaolong, fangchizheng, chunxiaozheng, and everyone in both communities who helped make this happen! Read the full story: https://t.co/K9nqWpLjsE #KVCache #LMCache #AI #inference

252

LMCache Lab

@lmcache

13 days ago

PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. In Prefill-Decode Disaggregation, a single LLM request is split across two types of nodes. A prefill node reads the prompt and produces the KV cache, while a decode node consumes that KV cache to generate tokens. The KV cache needs to move from the prefill node to the decode node over the network, typically through RDMA. In LMCache, the component responsible for moving these KV chunks is called the PDBackend. Before the asynchronous PDBackend, LMCache’s prefill workers sent KV cache chunks one at a time and waited for each transfer to finish before continuing. This worked for simple cases, but under chunked prefill, where a long prompt is split into multiple KV transfers, concurrent requests could deadlock. The new fully asynchronous PDBackend moves KV transfer off the critical path. Instead of blocking on each network transfer, the prefill worker can hand off KV chunks in the background and continue processing the next prompt. On the receiver side, LMCache also reserves enough buffer space for the whole request before the transfer starts, so each admitted request has enough room to finish. This update is a great community effort from LMCache. As Prefill-Decode Disaggregation becomes more widely used, improvements like async PDBackend are essential for making KV cache transfer more reliable and scalable. Thank you to everyone in the LMCache community who helped shape, review, and harden this update! #KVCache #LMCache #AI #inference

lmcache's tweet photo. PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache.

In Prefill-Decode Disaggregation, a single LLM request is split across two types of nodes. A prefill node reads the prompt and produces the KV cache, while a decode node consumes that KV cache to generate tokens. The KV cache needs to move from the prefill node to the decode node over the network, typically through RDMA. In LMCache, the component responsible for moving these KV chunks is called the PDBackend.

Before the asynchronous PDBackend, LMCache’s prefill workers sent KV cache chunks one at a time and waited for each transfer to finish before continuing. This worked for simple cases, but under chunked prefill, where a long prompt is split into multiple KV transfers, concurrent requests could deadlock.

The new fully asynchronous PDBackend moves KV transfer off the critical path. Instead of blocking on each network transfer, the prefill worker can hand off KV chunks in the background and continue processing the next prompt. On the receiver side, LMCache also reserves enough buffer space for the whole request before the transfer starts, so each admitted request has enough room to finish.

This update is a great community effort from LMCache. As Prefill-Decode Disaggregation becomes more widely used, improvements like async PDBackend are essential for making KV cache transfer more reliable and scalable. Thank you to everyone in the LMCache community who helped shape, review, and harden this update!

#KVCache #LMCache #AI #inference

318

LMCache Lab

@lmcache

14 days ago

𝐓𝐡𝐞 𝐎𝐩𝐞𝐧𝐀𝐈-𝐜𝐨𝐦𝐩𝐚𝐭𝐢𝐛𝐥𝐞 𝐀𝐏𝐈 𝐢𝐬 𝐛𝐞𝐜𝐨𝐦𝐢𝐧𝐠 𝐭𝐡𝐞 𝐈𝐏𝐯𝟒 𝐨𝐟 𝐋𝐋𝐌 𝐬𝐲𝐬𝐭𝐞𝐦𝐬. ⏳ At the top: agents, RAG, chatbots, tools, and workflows. In the middle: the OpenAI-compatible API. Below: routing, batching, scheduling, KV cache, inference engines, and hardware. It's the familiar network-layering pattern, where IPv4 acts as the narrow waist between everything above and below it. This shared interface lets applications run across providers while inference backends optimize underneath. But the narrow waist also creates a blind spot. Once modern LLM applications cross the API boundary, much of their structure becomes just a sequence of tokens. That's why the next wave of LLM infrastructure may depend on what happens below the API: better scheduling, better cache reuse, and better AI-native memory systems. 📖 Read more in Junchen Jiang's new blog, "𝐎𝐩𝐞𝐧𝐀𝐈 𝐀𝐏𝐈 𝐈𝐬 𝐭𝐡𝐞 𝐍𝐞𝐰 𝐈𝐏𝐯𝟒": https://t.co/lZQhGNtP8K #AI #Inference #LMCache #KVCache #Network

228

LMCache Lab

@lmcache

15 days ago

𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐯𝟎.𝟒.𝟓 𝐢𝐬 𝐨𝐮𝐭! 🎉 This release was a massive community effort. A huge shoutout to our 34 contributors who shipped 119 commits! Whether you submitted a PR, reported an issue, or joined the discussions—thank you for making this possible! 🎉 𝐖𝐡𝐚𝐭'𝐬 𝐍𝐞𝐰: 🔹 DeepSeek V4 support 🔹 TensorRT-LLM integration 🔹 Hardware support for AMD ROCm & Intel HPU 🔹 SERDE support for pluggable KV cache transformations ⚠️ 𝐇𝐞𝐚𝐝𝐬-𝐮𝐩: Our default CUDA wheel is now cu13 (cu12.9 is still available). 📖 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐟𝐮𝐥𝐥 𝐫𝐞𝐥𝐞𝐚𝐬𝐞 𝐧𝐨𝐭𝐞𝐬: https://t.co/gBMTMSldzx Thank you for being such an incredible community. We can't wait to see what we build together next! #AI #LLM #Inference #LMCache

lmcache's tweet photo. 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐯𝟎.𝟒.𝟓 𝐢𝐬 𝐨𝐮𝐭! 🎉
This release was a massive community effort. A huge shoutout to our 34 contributors who shipped 119 commits! Whether you submitted a PR, reported an issue, or joined the discussions—thank you for making this possible!

🎉 𝐖𝐡𝐚𝐭'𝐬 𝐍𝐞𝐰:
🔹 DeepSeek V4 support
🔹 TensorRT-LLM integration
🔹 Hardware support for AMD ROCm & Intel HPU
🔹 SERDE support for pluggable KV cache transformations

⚠️ 𝐇𝐞𝐚𝐝𝐬-𝐮𝐩: Our default CUDA wheel is now cu13 (cu12.9 is still available).

📖 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐟𝐮𝐥𝐥 𝐫𝐞𝐥𝐞𝐚𝐬𝐞 𝐧𝐨𝐭𝐞𝐬: https://t.co/gBMTMSldzx

Thank you for being such an incredible community. We can't wait to see what we build together next!

#AI #LLM #Inference #LMCache

230

LMCache Lab

@lmcache

16 days ago

Cache misses happen for a couple of reasons: 1. not enough KV Cache capacity 2. context changes: tool calls, system prompt, messages 3. model changes. For your own deployment, you can avoid all 3 (without changing the way you interact with your agent) via: 1. tuning your KV Cache Store size: https://t.co/Gz1PiHoRLc 2. CachBlend for non prefix reuse: https://t.co/WaRjs34N94 3. DroidSpeak for cross-LLM resuse: https://t.co/zmnZggEZR1

342

16 days ago

16 days ago

Prompt cache diagnostics are now in Claude Console. When a request misses the cache, you can now see exactly which part of your prompt changed and how many tokens it cost you.

ClaudeDevs's tweet photo. Prompt cache diagnostics are now in Claude Console.

When a request misses the cache, you can now see exactly which part of your prompt changed and how many tokens it cost you. https://t.co/z0dV6zzLPm

134

885

184K

409

LMCache Lab

@lmcache

16 days ago

@alexabelonix More development is on the way, stay tuned for what LMCache cooks up next 🫡🩷

LMCache Lab

@lmcache

16 days ago

𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐧𝐨𝐰 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐓𝐞𝐧𝐬𝐨𝐫𝐑𝐓-𝐋𝐋𝐌, alongside vLLM and SGLang! 🎉 With this integration, TensorRT-LLM can use LMCache for KV cache lookup, retrieve, and store during the request lifecycle. In our recommended 𝐦𝐮𝐥𝐭𝐢𝐩𝐫𝐨𝐜𝐞𝐬𝐬 (𝐌𝐏) 𝐦𝐨𝐝𝐞, the engines talk to a standalone LMCache server, enabling shared KV cache management across multiple TRT-LLM workers on the same node. The main engineering difference is TensorRT-LLM’s KV memory layout. Unlike vLLM and SGLang, which commonly expose KV cache in a layer-oriented layout, TensorRT-LLM packs multiple layers within shared KV cache blocks for efficient GPU access. LMCache now understands this packed layout and can efficiently read and write TensorRT-LLM KV cache. This brings LMCache’s KV reuse and multi-tier cache capabilities to TensorRT-LLM, connecting it to LMCache’s broad ecosystem. Start here: https://t.co/wTNcmpqhYm Explore validated recipes for models deployed with TensorRT-LLM and help us expand coverage — a great starting PR for new contributors: https://t.co/YWTMub9V6R #AI #LLM #Inference #LMCache #NVIDIA #vLLM #SGLang

lmcache's tweet photo. 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐧𝐨𝐰 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐓𝐞𝐧𝐬𝐨𝐫𝐑𝐓-𝐋𝐋𝐌, alongside vLLM and SGLang! 🎉

With this integration, TensorRT-LLM can use LMCache for KV cache lookup, retrieve, and store during the request lifecycle. In our recommended 𝐦𝐮𝐥𝐭𝐢𝐩𝐫𝐨𝐜𝐞𝐬𝐬 (𝐌𝐏) 𝐦𝐨𝐝𝐞, the engines talk to a standalone LMCache server, enabling shared KV cache management across multiple TRT-LLM workers on the same node.

The main engineering difference is TensorRT-LLM’s KV memory layout. Unlike vLLM and SGLang, which commonly expose KV cache in a layer-oriented layout, TensorRT-LLM packs multiple layers within shared KV cache blocks for efficient GPU access. LMCache now understands this packed layout and can efficiently read and write TensorRT-LLM KV cache.

This brings LMCache’s KV reuse and multi-tier cache capabilities to TensorRT-LLM, connecting it to LMCache’s broad ecosystem.

Start here:
https://t.co/wTNcmpqhYm

Explore validated recipes for models deployed with TensorRT-LLM and help us expand coverage — a great starting PR for new contributors:
https://t.co/YWTMub9V6R

#AI #LLM #Inference #LMCache #NVIDIA #vLLM #SGLang

575

LMCache Lab

@lmcache

Last Seen Users on Sotwe

Trends for you

Most Popular Users