llm-d published a new post on KServe + llm-d + vLLM for production LLM inference on Kubernetes.
Authors from @RedHat and Tesla describe how the stack addressed routing, customization, and day-2 operational challenges, citing 3x higher output tokens/s and 2x lower TTFT in one deployment after enabling prefix-cache aware routing.
By Yuan Tang, Scott Cabrinha, Robert Shaw, and Sai Krishna
@CloudNativeFdn
๐ @_llm_d_ https://t.co/hBjaZPJ3Pb
#vLLM #KServe #Kubernetes #LLMOps #OpenSource
ReasoningBank, a novel agent memory framework, enables LLM agents to continuously learn from both successful & failed experiences. Our evaluation shows that it enhances agent effectiveness, boosting success rates and efficiency. Learn more: https://t.co/lHlYzeKMcm
A company with 60+ accounts just had its entire AI infrastructure taken offline by their provider.
No reason given, all that was provided was an appeal path as a Google Form.
This is not a one-off, we have mapped the pattern across every major closed-weight provider and what enterprise teams can do about it.
๐ Read the full blog: https://t.co/NHjezy9ZpY
๐ Try Tensormesh with $100 in free GPU Credits: https://t.co/szVTe4pk5k
My time being spent:
before using claude code --> write code
after using claude code --> read code, understand and find potential issues
My mental effort is not getting much lighter lol.
Turns out we can get SOTA on agentic benchmarks with a simple test-time method!
Excited to introduce LLM-as-a-Verifier.
Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model:
1๏ธโฃ Ask the LLM to rank results on a scale of 1-k
2๏ธโฃ Use the log-probs of those rank tokens to calculate an expected score
You can get a verification score in a single sampling pass per candidate pair.
Blog: https://t.co/jYPZUgncLe
Code: https://t.co/caBpzd3Xkx
Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Latest models use efficient attentions like Mamba or sliding window. This gives huge potential in KV cache offloading layer --- LMCache needs to catch up.
GPU memory alone wonโt carry the next generation of LLM serving.
At #RaySummit, our Chief Scientist @this_will_echo shared how #LMCache offloads KV Cache across CPU RAM, local disk, Redis, and S3, while enabling cache reuse beyond basic prefix caching.
Watch the full talk on YouTube:
๐๐ปhttps://t.co/89qjddXbT1
#RaySummit #LMCache #Tensormesh #KVCache
we need agent evals that are really consistent with real world usages. otherwise people are optimizing foundation models for the wrong direction. the problem of targeting is even bigger than benchmaxxing.
Two years ago, we just have 2 NVIDIA A40.
Two years later, our project is mentioned in Jensen Huang's GTC talk.
Hope is the first-order weapon for human to fight for the future.
Some former colleagues from @lmcache shared this photo from the GTC Keynote. I am honestly surprised how fast the team has been growing. (We were a research lab on 2 A40 GPUs in 2023!)
btw I think they are hiring LLM hackers (or product hackers I am not sure ๐คช, you should just check with @JunchenJiang@ChengYihuaA)
#GTC #LLM #Inference #Nvidia #LMCache #KVCache
Why not store KV cache permanently?
In case you missed it, #IBM recently posted two blogs for ๐น๐น๐บ-๐ฑ + ๐๐ด๐ฆ + ๐๐ ๐๐ฎ๐ฐ๐ต๐ฒ-based KV storage. Thrilled to keep building together.
Avoiding recomputation is the goal, but itโs still rare to see KV cache treated as shared, persistent infrastructure in real production deployments.
Excited to see LMCache be part of this with IBM, a long-time collaborator of the LMCache community. Thrilled to keep building together.
These two posts are a great look at what that can actually look like in practice:
1. Rethinking LLM Inference Economics with llm-d, LMCache, and IBM Storage Scale
https://t.co/saHl7y9ujI
2. Deploying Distributed LLM Inference Service with IBM Storage Scale for KV Cache Offloading
https://t.co/UNl4MmvAYB
Great read for anyone interested in fast yet cheap LLM inference.
#LMCache #vLLM #Kubernetes #K8s #KVCache
"๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฐ๐ผ๐ป๐๐ฒ๐ ๐ ๐ถ๐ ๐๐ต๐ฒ ๐ป๐ฒ๐ ๐ฏ๐ผ๐๐๐น๐ฒ๐ป๐ฒ๐ฐ๐ธ" โ Kevin Deierling, SVP Networking #NVIDIA
At his #GTC talk last week, he highlighted ๐๐ ๐ซ and ๐๐ฎ๐ฐ๐ต๐ฒ๐๐น๐ฒ๐ป๐ฑ from ๐๐ ๐๐ฎ๐ฐ๐ต๐ฒ (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the ๐๐ ๐ซ ๐๐๐ผ๐ฟ๐ฎ๐ด๐ฒ ๐ฝ๐ฎ๐ฟ๐๐ป๐ฒ๐ฟ๐.
As the stack evolves, @tensormesh keeps building for what's next.
โถ๏ธ session Replay:
https://t.co/1UL4OspKsG
๐ด Live from #GTC2026
On the floor with our Chief Scientist @this_will_echo and CTO #Yihua Chang โ #KVCache is the hottest topic of the day. Even Jensen opened with it.
๐๏ธThey covered topics like:
#CacheBlend, @lmcache 0.4.0. and the super cool collab with @nvidia around a bot called #reachy using LMCache under the hood for 20x speedup
#GTC2026 #KVCache #LMCache #TensorMesh
Happy 2026 ๐ฅ
First post of the year: a technical benchmark.
In a joint study with @tensormesh , we achieved:
- 4ร TTFT improvement
- Prefix cache hit rate >50%
Using SSD-augmented KVCache on realistic multi-turn LLM traffic.
Full write-up on GMI Cloud: https://t.co/NALnwU01ke
๐ LMCache has officially been out for 1.5 years now!
Within its success, LMCache has become the default KV-cache library for open-source LLM inference (CPU offload, P2P sharing, multi-backend storage, vLLM/SGLang integration, and more).
As a PyTorch Foundation Ecosystem project, LMCache is now used by enterprise leaders across the industry (GKE, AWS, Nvidia's Dynamo, llm-dโฆ).
๐คWhatโs the secret to our product??
๐ Come see yourself: https://t.co/oE3SfgXpWC
โฅ๏ธ A huge thank you to our contributors and community, youโve influenced what makes LMCache today. (@lmcache)
#KVCache #LMCache #LLM #vLLM
Github is not acting normal...
Our LMCache logo suddenly disappeared today, we didn't make any change. And we cannot even clone the repo using ssh.
Github bad bad.