With RAG and agents becoming ubiquitous in LLM systems, tuning quality and performance JOINTLY is essential to achieve the best LLM quality-of-experience.
Our paper at SOSP this year, addresses this exact tradeoff!🔥
🚨 LMCache now turbocharges multimodal models in vLLM!
By caching image-token KV pairs, repeated images now get ~100% cache hit rate — cutting latency from 18s to ~1s.
Works out of the box.
Check the blog: https://t.co/WUiCF7adRN
Try it 👉 https://t.co/JaKbQCXFd3
#vLLM#MLLM #AIinfra #LMCache
Our open-source LLM cluster deployment solution is 10x faster than SOTA OSS solution. Check out the vLLM Production-Stack!🤩🤩🤩
Since Jan 2025, vLLM Production Stack has been the reference open-source vLLM inference cluster solution with advanced KV cache offloading and K8s native support. Today, our benchmarks show that it is:
✅10x better performance than SOTA OSS solution (AIBrix) in multi-turn chat
✅More stable after set up
Reproduce it yourself:
📝Blog post and benchmark: https://t.co/S2SJNmKfnC
🔗Github repo: https://t.co/6RSJK1AZJx
📺30s demo: https://t.co/T2PHba6Yqu
#vLLM #LLM #GenAI #OpenSource #Inference #AI
🚀 The LMCache docs website are now live! 🎉
Whether you're new to LLMs or a pro, our doc covers your need!
📚 Getting Started guides
🔍 Small examples
👨💻 Code documentations
Boost your LLM deployment today!
Check our blogpost!
https://t.co/rZok91jWKy