Kuntai Du @this_will_echo - Twitter Profile

Kuntai Du @this_will_echo

about 1 month ago

Sooo smol

Cat 🐈

@CuteCatsMagic

about 1 month ago

58

12K

1K

414K

0

1

0

47

Kuntai Du @this_will_echo

about 1 month ago

Goblin is lowkey invading open-source projects lol.

LMCache Lab

@lmcache

about 1 month ago

GOBLIN MODE: ON , , /(.-" "-.)\ |\ \/ \/ /| | \ / =. .= \ / | \( \ o\/o / ) / \_, '-/ \-' ,_/ / \__/ \ \ \__/\__/ / ___\ \| -- |/ /___ /` \ / `\ / '----' \ [!] Goblin breach detected // LMCache docs > Click to make them go away ▓▒░ https://t.co/0MnbAESrYS

0

10

4

0

654

0

1

0

134

Kuntai Du @this_will_echo

about 1 month ago

massage and then cut, cruel

Liora Madeleine Grace

@radeleince

about 1 month ago

Cutting a sweet mango with machine

1K

8K

941

1K

8M

0

44

Kuntai Du @this_will_echo

about 1 month ago

So glad to see our project --- LMCache --- is included in the discussion!

PyTorch

@PyTorch

about 1 month ago

llm-d published a new post on KServe + llm-d + vLLM for production LLM inference on Kubernetes. Authors from @RedHat and Tesla describe how the stack addressed routing, customization, and day-2 operational challenges, citing 3x higher output tokens/s and 2x lower TTFT in one deployment after enabling prefix-cache aware routing. By Yuan Tang, Scott Cabrinha, Robert Shaw, and Sai Krishna @CloudNativeFdn 🔗 @_llm_d_ https://t.co/hBjaZPJ3Pb #vLLM #KServe #Kubernetes #LLMOps #OpenSource

PyTorch's tweet photo. llm-d published a new post on KServe + llm-d + vLLM for production LLM inference on Kubernetes.

Authors from @RedHat and Tesla describe how the stack addressed routing, customization, and day-2 operational challenges, citing 3x higher output tokens/s and 2x lower TTFT in one deployment after enabling prefix-cache aware routing.

By Yuan Tang, Scott Cabrinha, Robert Shaw, and Sai Krishna

@CloudNativeFdn

🔗 @_llm_d_ https://t.co/hBjaZPJ3Pb

#vLLM #KServe #Kubernetes #LLMOps #OpenSource

4

117

11

57

12K

0

6

0

771

Kuntai Du @this_will_echo

about 1 month ago

Free yes, only in April no

Zo Computer

@zocomputer

about 2 months ago

Try the top models for free in April

0

248

18

171

4M

0

1

0

59

Kuntai Du @this_will_echo

about 1 month ago

Key trick: don't use existing interactions directly ---- try to generate "lessons" from previous interactions instead.

Google Research

@GoogleResearch

about 1 month ago

ReasoningBank, a novel agent memory framework, enables LLM agents to continuously learn from both successful & failed experiences. Our evaluation shows that it enhances agent effectiveness, boosting success rates and efficiency. Learn more: https://t.co/lHlYzeKMcm

GoogleResearch's tweet photo. ReasoningBank, a novel agent memory framework, enables LLM agents to continuously learn from both successful & failed experiences. Our evaluation shows that it enhances agent effectiveness, boosting success rates and efficiency. Learn more: https://t.co/lHlYzeKMcm https://t.co/DZa42JfqFX

31

2K

200

1K

108K

0

2

0

110

Kuntai Du @this_will_echo

about 1 month ago

LLM providers start to take control...

Tensormesh

@tensormesh

about 1 month ago

A company with 60+ accounts just had its entire AI infrastructure taken offline by their provider. No reason given, all that was provided was an appeal path as a Google Form. This is not a one-off, we have mapped the pattern across every major closed-weight provider and what enterprise teams can do about it. 📖 Read the full blog: https://t.co/NHjezy9ZpY 🚀 Try Tensormesh with $100 in free GPU Credits: https://t.co/szVTe4pk5k

tensormesh's tweet photo. A company with 60+ accounts just had its entire AI infrastructure taken offline by their provider.

No reason given, all that was provided was an appeal path as a Google Form.

This is not a one-off, we have mapped the pattern across every major closed-weight provider and what enterprise teams can do about it.

📖 Read the full blog: https://t.co/NHjezy9ZpY

🚀 Try Tensormesh with $100 in free GPU Credits: https://t.co/szVTe4pk5k

0

4

0

200

0

1

0

113

Kuntai Du @this_will_echo

about 2 months ago

My time being spent: before using claude code --> write code after using claude code --> read code, understand and find potential issues My mental effort is not getting much lighter lol.

0

4

0

187

Kuntai Du @this_will_echo

about 2 months ago

I want stardew valley on my IDE 😝

Aman

@Amank1412

about 2 months ago

Someone built a transparent Mario game that runs OVER IDE so can play while waiting for Copilot to write code.

165

6K

594

2K

775K

0

1

0

72

Kuntai Du @this_will_echo

about 2 months ago

Heard that Qwen close-sourced their best model 😈

Qwen

@Alibaba_Qwen

about 2 months ago

⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog：https://t.co/EXx5y466su Qwen Studio：https://t.co/bg4tAU1p74 HuggingFace：https://t.co/w4pDX14DZS ModelScope：https://t.co/SuRyLzdQiO API（‘Qwen3.6-Flash’ on Model Studio）：Coming soon～ Stay tuned

Alibaba_Qwen's tweet photo. ⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀

A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.

🔥 Agentic coding on par with models 10x its active size
📷 Strong multimodal perception and reasoning ability
🧠 Multimodal thinking + non-thinking modes

Efficient. Powerful. Versatile. Try it now👇

Blog：https://t.co/EXx5y466su
Qwen Studio：https://t.co/bg4tAU1p74
HuggingFace：https://t.co/w4pDX14DZS
ModelScope：https://t.co/SuRyLzdQiO
API（‘Qwen3.6-Flash’ on Model Studio）：Coming soon～ Stay tuned

445

11K

2K

5K

3M

0

123

Kuntai Du @this_will_echo

about 2 months ago

LLM for reranking helps you push Terminal Bench Sota by so much!

Azalia Mirhoseini

@Azaliamirh

about 2 months ago

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: https://t.co/jYPZUgncLe Code: https://t.co/caBpzd3Xkx Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

Azaliamirh's tweet photo. Turns out we can get SOTA on agentic benchmarks with a simple test-time method!

Excited to introduce LLM-as-a-Verifier.

Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model:

1️⃣ Ask the LLM to rank results on a scale of 1-k
2️⃣ Use the log-probs of those rank tokens to calculate an expected score

You can get a verification score in a single sampling pass per candidate pair.

Blog: https://t.co/jYPZUgncLe
Code: https://t.co/caBpzd3Xkx

Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

34

981

112

954

116K

0

13

1

11

3K

Kuntai Du @this_will_echo

about 2 months ago

Latest models use efficient attentions like Mamba or sliding window. This gives huge potential in KV cache offloading layer --- LMCache needs to catch up.

Tensormesh

@tensormesh

about 2 months ago

GPU memory alone won’t carry the next generation of LLM serving. At #RaySummit, our Chief Scientist @this_will_echo shared how #LMCache offloads KV Cache across CPU RAM, local disk, Redis, and S3, while enabling cache reuse beyond basic prefix caching. Watch the full talk on YouTube: 👉🏻https://t.co/89qjddXbT1 #RaySummit #LMCache #Tensormesh #KVCache

0

9

1

0

518

0

2

0

1

150

Kuntai Du @this_will_echo

about 2 months ago

Lol benchmaxxing, sooo true

Junyang Lin

@JustinLin610

about 2 months ago

we need agent evals that are really consistent with real world usages. otherwise people are optimizing foundation models for the wrong direction. the problem of targeting is even bigger than benchmaxxing.

22

248

19

35

31K

0

2

0

584

Kuntai Du @this_will_echo

about 2 months ago

Two years ago, we just have 2 NVIDIA A40. Two years later, our project is mentioned in Jensen Huang's GTC talk. Hope is the first-order weapon for human to fight for the future.

Hanchen Li

@lihanc02

3 months ago

Some former colleagues from @lmcache shared this photo from the GTC Keynote. I am honestly surprised how fast the team has been growing. (We were a research lab on 2 A40 GPUs in 2023!) btw I think they are hiring LLM hackers (or product hackers I am not sure 🤪, you should just check with @JunchenJiang @ChengYihuaA) #GTC #LLM #Inference #Nvidia #LMCache #KVCache

lihanc02's tweet photo. Some former colleagues from @lmcache shared this photo from the GTC Keynote. I am honestly surprised how fast the team has been growing. (We were a research lab on 2 A40 GPUs in 2023!)

btw I think they are hiring LLM hackers (or product hackers I am not sure 🤪, you should just check with @JunchenJiang @ChengYihuaA)

#GTC #LLM #Inference #Nvidia #LMCache #KVCache

0

14

3

0

2K

0

5

3

1K

this_will_echo retweeted

LMCache Lab

@lmcache

2 months ago

Why not store KV cache permanently? In case you missed it, #IBM recently posted two blogs for 𝗹𝗹𝗺-𝗱 + 𝗞𝟴𝗦 + 𝗟𝗠𝗖𝗮𝗰𝗵𝗲-based KV storage. Thrilled to keep building together. Avoiding recomputation is the goal, but it’s still rare to see KV cache treated as shared, persistent infrastructure in real production deployments. Excited to see LMCache be part of this with IBM, a long-time collaborator of the LMCache community. Thrilled to keep building together. These two posts are a great look at what that can actually look like in practice: 1. Rethinking LLM Inference Economics with llm-d, LMCache, and IBM Storage Scale https://t.co/saHl7y9ujI 2. Deploying Distributed LLM Inference Service with IBM Storage Scale for KV Cache Offloading https://t.co/UNl4MmvAYB Great read for anyone interested in fast yet cheap LLM inference. #LMCache #vLLM #Kubernetes #K8s #KVCache

0

8

2

290

Kuntai Du @this_will_echo

2 months ago

Physical LLM is on the way lol

Tensormesh

@tensormesh

2 months ago

"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: https://t.co/1UL4OspKsG

tensormesh's tweet photo. "𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA
At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀.
As the stack evolves, @tensormesh keeps building for what's next.

▶️ session Replay:
https://t.co/1UL4OspKsG

0

9

3

0

560

0

1

0

121

this_will_echo retweeted

Tensormesh

@tensormesh

3 months ago

🔴 Live from #GTC2026 On the floor with our Chief Scientist @this_will_echo and CTO #Yihua Chang — #KVCache is the hottest topic of the day. Even Jensen opened with it. 🎙️They covered topics like: #CacheBlend, @lmcache 0.4.0. and the super cool collab with @nvidia around a bot called #reachy using LMCache under the hood for 20x speedup #GTC2026 #KVCache #LMCache #TensorMesh

0

14

3

0

454

Kuntai Du @this_will_echo

5 months ago

By offloading KV caches to SSD, we managed to reduce the time-to-first-token for @gmi_cloud without ANY extra infra cost!

GMI Cloud

@gmi_cloud

5 months ago

Happy 2026 🥂 First post of the year: a technical benchmark. In a joint study with @tensormesh , we achieved: - 4× TTFT improvement - Prefix cache hit rate >50% Using SSD-augmented KVCache on realistic multi-turn LLM traffic. Full write-up on GMI Cloud: https://t.co/NALnwU01ke

0

14

3

0

722

0

2

0

379

this_will_echo retweeted

Junchen Jiang

@JunchenJiang

6 months ago

🚀 LMCache has officially been out for 1.5 years now! Within its success, LMCache has become the default KV-cache library for open-source LLM inference (CPU offload, P2P sharing, multi-backend storage, vLLM/SGLang integration, and more). As a PyTorch Foundation Ecosystem project, LMCache is now used by enterprise leaders across the industry (GKE, AWS, Nvidia's Dynamo, llm-d…). 🤔What’s the secret to our product?? 🔎 Come see yourself: https://t.co/oE3SfgXpWC ♥️ A huge thank you to our contributors and community, you’ve influenced what makes LMCache today. (@lmcache) #KVCache #LMCache #LLM #vLLM

JunchenJiang's tweet photo. 🚀 LMCache has officially been out for 1.5 years now!

Within its success, LMCache has become the default KV-cache library for open-source LLM inference (CPU offload, P2P sharing, multi-backend storage, vLLM/SGLang integration, and more).

As a PyTorch Foundation Ecosystem project, LMCache is now used by enterprise leaders across the industry (GKE, AWS, Nvidia's Dynamo, llm-d…).

🤔What’s the secret to our product??

🔎 Come see yourself: https://t.co/oE3SfgXpWC

♥️ A huge thank you to our contributors and community, you’ve influenced what makes LMCache today. (@lmcache)

#KVCache #LMCache #LLM #vLLM

5

16

2

6

2K

Kuntai Du @this_will_echo

7 months ago

Github is not acting normal... Our LMCache logo suddenly disappeared today, we didn't make any change. And we cannot even clone the repo using ssh. Github bad bad.

0

206

Kuntai Du

@this_will_echo

Last Seen Users on Sotwe

Trends for you

Most Popular Users