LLM inference is too slow, too expensive, and too hard to scale.
🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute.
Here’s how it works—and how you can use it today:
🇬🇧 London, June 10.
@vllm_project & @_llm_d_ Inference Meetup, hosted by Red Hat AI, @nvidia, and @SteliaAI.
Talks on vLLM updates, speculative decoding, llm-d in production, AI safety, and more.
Plus food, drinks, and the people building this stuff. https://t.co/QC4d1yKPbc
Llama 70B as a cloud endpoint costs exponentially more than Llama 8B.
For teams where a smaller model meets the quality bar, that gap is hard to ignore. And with INT4 quantization: 4x smaller, 2x faster, less than 1% accuracy loss.
The right model isn't always the biggest one.
https://t.co/23IHcSmDkk
Calling Boston area startups building with AI. 🤙
We're kicking off 2026 with the first event in a new monthly, in person hackathon series hosted by @RedHat and @IBM in Boston’s Seaport District.
This one day hackathon is designed specifically for local startups that want to move faster from idea to working prototype.
Instead of a fixed theme, you bring a real AI problem your team is actively facing. We help you build a proof of concept using open source, enterprise ready templates from https://t.co/byiy4bdAZa, including MCP Server, AI Agent, and UI templates.
What you will get:
⚡ Rapid prototyping without boilerplate
🧠 Hands on guidance from Red Hat AI architects
🤝 Connections with other Boston based AI startups and ecosystem partners
If you are a Boston startup looking to turn an AI challenge into something real, this is for you.
Event details are shared after registration.
Register now: https://t.co/K9WJmqfBjt
The @RedHat_AI team contributes a lot to vLLM and does amazing work for the open-source community. Great to see vLLM performing so well compared to TRT-LLM on H200! vLLM comes pretty close to B200, with the @NVIDIAAI team working on closing the gap for GPTOSS within the next couple of updates.
InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more.
Here’s what’s new in the @vllm_project community over the past two weeks:
4 tracks. 12 sessions. 1 day of learning.
Join us on Oct. 16 for Red Hat AI Day of Learning, a free virtual event for developers, engineers & practitioners.
Tracks:
⚡ Fast & efficient inference
🎯 Model customization
🤖 Agentic AI
🌐 Scaling AI over hybrid cloud
Sessions include:
· Intro to vLLM and how to get started
· Model optimization with LLM Compressor
· Lossless LLM inference acceleration w/ Speculators
· End-to-end model customization
· Synthetic data generation and data processing
· Continual learning of LLMs with Training Hub
· Build open source agentic AI solutions
· Intro to Model Context Protocol (MCP)
· Intro to Llama Stack
· Intro to distributed inference
· Distributed inference with llm-d
· Scaling AI Infrastructure
👉 Register free: https://t.co/47t6ts4A4c
Qwen3-Next dropped yesterday and you can run it with Red Hat AI today.
✅ Day-zero support in vLLM
✅ Day-one deployment with Red Hat AI
Step-by-step guide: https://t.co/ZjLJyfmMJm
The future of AI is open.
Thanks to the @lmcache team for joining forces with Red Hat on llm-d!
llm-d is a new open source project for scalable, efficient distributed LLM inference with @vllm_project.
Learn more about our collaboration here: https://t.co/bGLTJoHGmi
@RedHat_AI Adding a shoutout to the @IBMResearch team working jointly with AMD team on contributing Triton attention kernels in vLLM v1 that improved decode throughput by 3x on Llama and Granite models.
Really excited to see the emergence of llm-d @addvin ! Inference is the biggest workload in human history and the open source tools need to keep evolving to serve it
The llm-d project is a major step forward for the #opensource AI ecosystem, and we are proud to be one of the founding contributors, reflecting our commitment to collaboration as a catalyst for innovation in generative AI.
As generative and agentic AI continue to evolve, scalable, high-performance inference will be critical to unlocking their full potential.
That’s why we’re partnering with @RedHat and other contributors to grow the llm-d community and accelerate its capabilities—powered by our contributions, including innovations from NVIDIA Dynamo such as NIXL.
🔗 Explore and contribute on GitHub: https://t.co/U7OgK2PgMl
📰 Read the launch blog: https://t.co/u8Nyhxj2w2
🎙️ Hear from NVIDIA’s VP of Engineering & AI Frameworks, Ujval Kapasi → https://t.co/cfA8hlTWeT
DeepSeek’s Open Source Week drops A LOT of exciting goodies! We’re hosting vLLM Office Hours tomorrow—learn what they are, how they integrate with vLLM, & ask questions!
Date: Thursday, Thu, Feb 27
Time: 2PM ET / 11AM PT
Register: https://t.co/zTjNvaFusp #DeepSeek #AI
At @RedHat, we believe the future of AI is open. That's why I'm incredibly excited about our acquisition of @NeuralMagic. Together, we're furthering our commitment to our customers and the open source community to deliver on the future of AI—and that starts today.
Today, Red Hat completed the acquisition of @NeuralMagic, a pioneer in software and algorithms that accelerate #GenAI inference workloads. Read how we are accelerating our vision for #AI’s future: https://t.co/PkGfC48tAt.
If you are at #NeurIPS2024 this week, stop by the Neural Magic booth #307 and talk to us about the @vllm_project! vLLM core committer @mgoin_ will be there, ready to hear your ideas and share them with the team. The best feature requests always come from in-person chats!
For our last seminar of the year we will end with Lucas Wilkinson from @neuralmagic presenting!
Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs
Time: Dec 4, 3pm EST Sign up via https://t.co/EvbCJnxpr8 to join our mailing list for the zoom link