vLLM @vLLM_Project - Twitter Profile

about 9 hours ago

🎉 The vLLM community just got a free course, built by @RedHat_AI with @DeepLearningAI. It walks through the full optimize → deploy → benchmark lifecycle for serving open models. Three labs, each on a live vLLM server: - Compress: quantize a Qwen model with LLM Compressor, then measure the size vs. accuracy tradeoff - Serve: deploy with vLLM's OpenAI-compatible API and watch continuous batching, PagedAttention, and prefix caching in the live metrics - Benchmark: simulate traffic with GuideLLM and check quality with lm-eval A lot of the work went into visualizing what actually happens under inference, thanks to @cedricclyburn: how tokens flow through the model, how the KV cache grows in GPU memory, and what changes when you move from FP16 to INT8/INT4. ~1.5 hours, 9 lessons, 3 labs. Free on https://t.co/pGAwqbmMdc. 📝 Read more: https://t.co/r8ITc2prI2

DeepLearning.AI

@DeepLearningAI

about 19 hours ago

New short course: Fast & Efficient LLM Inference with vLLM, built in partnership with @RedHat and taught by @cedricclyburn. Learn to quantize an open-source LLM, serve it with vLLM, and benchmark your deployment across speed, cost, and accuracy. Free to enroll: https://t.co/czVwJBnLZ6

10

244

49

171

26K

7

139

16

94

10K

vLLM

@vllm_project

about 17 hours ago

Congrats to the @googlegemma team on the Gemma 4 12B launch 🎉 Day-0 support on vLLM is ready to go. It's an encoder-free unified multimodal model — text, image, audio, and video all project straight into the LLM's embedding space, no separate vision or audio towers. 256K context, built-in thinking, native tool calling. Reasoning + tool parsers (`gemma4`), vision, and audio all served through the OpenAI-compatible API. 🔗 Recipe: https://t.co/MGJcoQkwzz

Google Gemma

@googlegemma

about 19 hours ago

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

googlegemma's tweet photo. Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇 https://t.co/gf4FZv0WZb

323

11K

1K

4K

2M

8

311

29

77

14K

vLLM

@vllm_project

1 day ago

Intel's AutoRound post-training quantization is now integrated natively into vLLM-Omni, bringing W4A16 (4-bit) to Omni multimodal, diffusion image, and video models. It cuts Qwen3-Omni-30B from 66GB to 25GB with no quality cliff. Quantize once offline, then serve with the same command as the BF16 model. 📈 Highlights: - weights shrink to ~1/4 of BF16, dropping FLUX.1-dev from 4 GPUs (TP=4) to a single one - accuracy held on OmniBench, with ~1.3% drift on text-to-image - on Intel XPU B60, the freed memory unlocks CFG Parallel for 1.55-1.67x faster guided diffusion 🙌 Thanks to @IntelAI, @HaihaoShen, and the vLLM-Omni team for the collaboration! 🔗 https://t.co/LRvkwAnNcL

vllm_project's tweet photo. Intel's AutoRound post-training quantization is now integrated natively into vLLM-Omni, bringing W4A16 (4-bit) to Omni multimodal, diffusion image, and video models. It cuts Qwen3-Omni-30B from 66GB to 25GB with no quality cliff. Quantize once offline, then serve with the same command as the BF16 model.

📈 Highlights:
- weights shrink to ~1/4 of BF16, dropping FLUX.1-dev from 4 GPUs (TP=4) to a single one
- accuracy held on OmniBench, with ~1.3% drift on text-to-image
- on Intel XPU B60, the freed memory unlocks CFG Parallel for 1.55-1.67x faster guided diffusion

🙌 Thanks to @IntelAI, @HaihaoShen, and the vLLM-Omni team for the collaboration!

🔗 https://t.co/LRvkwAnNcL

4

91

11

47

10K

vllm_project retweeted

Matej Sirovatka

@m_sirovatka

1 day ago

KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀

14

330

24

134

28K

vLLM

@vllm_project

2 days ago

🎉 MiniCPM-o 4.5 has been merged into vLLM-Omni 🎙️ It's a 9B end-to-end omnimodal model: text / image / audio / video in, text + speech out. This first integration brings simplex, non-streaming serving, with duplex streaming up next. 🚀 `vllm serve openbmb/MiniCPM-o-4_5 --omni --trust-remote-code` Thanks to the @OpenBMB and vLLM-Omni team for the collaboration 🙌 🔗 Recipe: https://t.co/Qp6M74Gdib

OpenBMB

@OpenBMB

2 days ago

🎉 Exciting News! MiniCPM-o 4.5 has officially been merged into vLLM-Omni! 👏 💡 Currently, it supports simplex non-streaming usage—a solid first step for what's to come. 🚀 The journey continues! We will keep pushing forward with the amazing open-source community to integrate duplex streaming capabilities next. Get ready for an even more powerful and seamless interactive experience. Stay tuned! 💪 https://t.co/tbkwuHoPmv https://t.co/bRcld16Jco

OpenBMB's tweet photo. 🎉 Exciting News! MiniCPM-o 4.5 has officially been merged into vLLM-Omni! 👏
💡 Currently, it supports simplex non-streaming usage—a solid first step for what's to come.
🚀 The journey continues! We will keep pushing forward with the amazing open-source community to integrate duplex streaming capabilities next.
Get ready for an even more powerful and seamless interactive experience. Stay tuned! 💪
https://t.co/tbkwuHoPmv
https://t.co/bRcld16Jco

7

56

5

21

14K

4

106

6

33

14K

vLLM

@vllm_project

2 days ago

🎉 Congrats to @JetBrains on Mellum2-12B-A2.5B-Thinking, an open-source 12B MoE that activates just 2.5B params, handling both natural language and code with a 128K context. Mellum2 runs natively in vLLM from day 0, with reasoning parser and tool calling for agentic workflows. 🔗 https://t.co/72E6HDOHNf

vllm_project's tweet photo. 🎉 Congrats to @JetBrains on Mellum2-12B-A2.5B-Thinking, an open-source 12B MoE that activates just 2.5B params, handling both natural language and code with a 128K context.

Mellum2 runs natively in vLLM from day 0, with reasoning parser and tool calling for agentic workflows.

🔗 https://t.co/72E6HDOHNf

JetBrains @jetbrains

3 days ago

Mellum started with code completion. Mellum2 is built for more – handling both natural language and code. A 12B-parameter open-source LLM for routing, RAG, and sub-agents, optimized for ultra-low-latency inference. Now on @huggingface. Learn more: https://t.co/28sG8Ql52L

24

328

64

147

96K

22

454

42

188

42K

vLLM

@vllm_project

3 days ago

🚀 We're excited to partner with @NVIDIARTXSpark pushing local AI agents forward on DGX Spark + RTX! This is exactly the direction explored in @Inferact's hands-on #vLLM + #DGXSpark blog—serving large NVFP4 models locally on NVIDIA DGX Spark. vLLM is an ideal fit, bringing streaming responses, paged KV cache, runtime tuning, and Prometheus metrics into a familiar serving workflow. To put it to the test, the team built a live 20 Questions game—using DGX Spark as the local inference endpoint and vLLM telemetry to track real serving behavior. Key takeaways: 🧠 DGX Spark brings large-model inference closer to developers ⚙️ Unified memory makes serving config especially important 📦 NVFP4 MoE models are a strong fit for Spark's local inference profile 📊 vLLM gives developers the tools and metrics to understand real serving behavior Try it on your own Spark 👇 https://t.co/bWkBfpKf3M

NVIDIA RTX Spark

@NVIDIARTXSpark

3 days ago

Local AI Agents are leveling up across DGX Spark & RTX PCs. NVIDIA OpenShell is coming to Windows alongside new agentic AI optimizations and creator app updates — including NVIDIA Broadcast 2.2, plus upcoming RTX acceleration for Adobe apps and Blender. More 👇

NVIDIARTXSpark's tweet photo. Local AI Agents are leveling up across DGX Spark & RTX PCs.

NVIDIA OpenShell is coming to Windows alongside new agentic AI optimizations and creator app updates — including NVIDIA Broadcast 2.2, plus upcoming RTX acceleration for Adobe apps and Blender.

More 👇 https://t.co/GroTth3TaM

6

255

22

27

33K

3

28

1

4

9K

vLLM

@vllm_project

3 days ago

🚀 Excited to partner with @NVIDIAAI on day-0 support for Cosmos 3 on vLLM-Omni! A unified Mixture-of-Transformers fusing an AR reasoner + diffusion generator across text, image, video, audio & robot action - all behind a single OpenAI-compatible API, with a ready-to-deploy Docker image! 📖Check out the detailed deployment guide👇 https://t.co/mhLvsZXB4i

NVIDIA AI

@NVIDIAAI

3 days ago

Introducing Cosmos 3: Our latest frontier model for Physical AI Cosmos 3 is the world’s first fully open omnimodel with native vision reasoning, world and action generation. Today we’re releasing Super (32B) and Nano (8B) variants.

94

3K

404

1K

346K

10

245

34

59

31K

vLLM

@vllm_project

4 days ago

🚀Great to see @RedHat_AI and @poolsideai team up to make Laguna XS.2 faster and cheaper to serve in vLLM. A DFlash speculator built with Speculators drafts 8 tokens per forward pass for 2-3x faster decoding at no quality loss, and LLM Compressor adds FP8 / NVFP4 / INT4 checkpoints so you can match your hardware budget. 🔗 https://t.co/3mguMKtwTt

Red Hat AI

@RedHat_AI

5 days ago

Laguna XS.2 from @poolsideai is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with @vllm_project:

6

56

13

23

20K

5

106

14

28

21K

vLLM

@vllm_project

5 days ago

Models, serving, and what to know before upgrading: 🆕 New architectures: MiniCPM-V 4.6, InternS2 Preview, OpenVLA 🦅 Spec decode: custom callable proposers; Gemma3/Gemma4 multi-GPU fixes + batched vision encoders 🔧 Tool calling improvements across multiple model families ⚠️ Breaking: • Removed deprecated `get_tokenizer` and `resolve_hf_chat_template` locations • Eliminated deprecated MLA prefill arguments • Cleaned up dead CUDA kernels and code 🙏 Thanks to all 230 contributors this cycle (63 new). 📖 Full release notes → https://t.co/fUAdzeCogu

0

13

3

8

3K

vLLM

@vllm_project

5 days ago

vLLM v0.22.0 is out! 459 commits from 230 contributors (63 new). 🎉 Highlights: DeepSeek V4 hardening (NVFP4 fused MoE, full+piecewise CUDA graph, ROCm support), experimental Rust frontend in-Tree, batch-invariant Cutlass FP8 (28.9% lower e2e latency), Model Runner V2 advances, multi-tier KV cache offloading. Thread 👇

vllm_project's tweet photo. vLLM v0.22.0 is out! 459 commits from 230 contributors (63 new). 🎉

Highlights: DeepSeek V4 hardening (NVFP4 fused MoE, full+piecewise CUDA graph, ROCm support), experimental Rust frontend in-Tree, batch-invariant Cutlass FP8 (28.9% lower e2e latency), Model Runner V2 advances, multi-tier KV cache offloading.

Thread 👇

14

506

53

121

54K

vLLM

@vllm_project

5 days ago

Hardware & performance: ⚡ Batch invariance: Cutlass FP8 path → 28.9% end-to-end latency improvement, compile-mode support on SM80, NVFP4 Cutlass linear path 🟢 NVIDIA Blackwell: FlashInfer MoE + FP4 GEMM for SM120/121; per-tensor FP8 CUTLASS on SM12.1 🔴 AMD ROCm: DeepSeek V4 functionality gains, flash sparse MLA Triton kernels, gluon paged MQA logits 💻 CPU / RISC-V: RVV-optimized attention kernels, fused GDN for AMX, experimental Triton + MRv2

1

26

3

5

4K

vLLM

@vllm_project

5 days ago

Great to work with @NVIDIAAI on running @StepFun_ai's Step-3.7-Flash with vLLM on the DGX Station.🤝 Run it locally, or in production as an @NVIDIA NIM container, both powered by vLLM! Serve it now 🚀 🔗 https://t.co/VnMzyXW75R

NVIDIA AI

@NVIDIAAI

6 days ago

You can read more about running Step 3.7 Flash on NVIDIA GPUs here: https://t.co/oNqZH0J6kr

2

48

9

22

21K

3

80

6

24

20K

vllm_project retweeted

Red Hat AI

@RedHat_AI

6 days ago

Speculators v0.5.0 just dropped with 3 big updates: - DFlash training support. Draft all tokens in one pass via block diffusion - Unified online/offline training powered by @vllm_project's hidden states extraction system - Docs & tutorials overhaul for faster onboarding https://t.co/tCMZIrgQf2

0

48

7

12

4K

vLLM

@vllm_project

6 days ago

We've shipped two major upgrades for RL✨! 1. Native weight syncing APIs: Standardizes weight transfer, provides optimized implementations for NCCL and CUDA IPC out of the box, and also lets frameworks easily bring their own. 2. Improved pause/resume for Async RL: Careful coordination between DP ranks so that engines don’t deadlock. Validated at scale in P/D, wide-EP setups! In collaboration with @anyscalecompute, @NovaSkyAI, and @RedHat. More and more RL frameworks are using vLLM as the default for inference, details in the blog 👇 https://t.co/LLmL8zJLtR

vllm_project's tweet photo. We've shipped two major upgrades for RL✨!

1. Native weight syncing APIs: Standardizes weight transfer, provides optimized implementations for NCCL and CUDA IPC out of the box, and also lets frameworks easily bring their own.
2. Improved pause/resume for Async RL: Careful coordination between DP ranks so that engines don’t deadlock. Validated at scale in P/D, wide-EP setups!

In collaboration with @anyscalecompute, @NovaSkyAI, and @RedHat.

More and more RL frameworks are using vLLM as the default for inference, details in the blog 👇

https://t.co/LLmL8zJLtR

2

212

28

73

11K

vLLM

@vllm_project

6 days ago

🎉 Congrats to @StepFun_ai on releasing Step-3.7-Flash, with day-0 support in vLLM. - 198B sparse MoE vision-language model, ~11B active params per token, native image + text input - 256K context window for long docs, multi-file repos, and dense visual interfaces - FP8 and NVFP4 quantized weights ready to go built-in MTP speculative decoding, native tool calling, and reasoning parsing. Serve it now! 🔗 https://t.co/XEqSqFEmVt

vllm_project's tweet photo. 🎉 Congrats to @StepFun_ai on releasing Step-3.7-Flash, with day-0 support in vLLM.

- 198B sparse MoE vision-language model, ~11B active params per token, native image + text input
- 256K context window for long docs, multi-file repos, and dense visual interfaces
- FP8 and NVFP4 quantized weights ready to go

built-in MTP speculative decoding, native tool calling, and reasoning parsing. Serve it now!

🔗 https://t.co/XEqSqFEmVt

StepFun @StepFun_ai

6 days ago

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: https://t.co/kqlZkVIRHv HuggingFace: https://t.co/qqceCrgPiw GGUF: https://t.co/rR6XrnymWG ModelScope: https://t.co/wney6Tzvqy API: https://t.co/RvHWzRG7Fu Blog: https://t.co/BxDiajiQ5G

StepFun_ai's tweet photo. ⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency.

#1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0.

Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution.

- 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels.
- Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees.
- Web + visual search reaches further: more sources, deeper follow-up.
- Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels.
- Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP.
- Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395.

GitHub: https://t.co/kqlZkVIRHv
HuggingFace: https://t.co/qqceCrgPiw
GGUF: https://t.co/rR6XrnymWG
ModelScope: https://t.co/wney6Tzvqy
API: https://t.co/RvHWzRG7Fu
Blog: https://t.co/BxDiajiQ5G

116

2K

208

620

334K

4

353

34

91

28K

vLLM

@vllm_project

7 days ago

Love seeing vLLM as the engine powering @NVIDIAAI's Dynamo Snapshot. Checkpoint the full vLLM worker (process tree + GPU weights/CUDA context) via cuda-checkpoint + CRIU, then restore with memfd parallel restore in <5s. Great work to the @NVIDIAAI team!

NVIDIA AI

@NVIDIAAI

7 days ago

Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds. In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests. Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

NVIDIAAI's tweet photo. Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds.

In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests.

Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

23

358

53

147

61K

1

127

14

43

14K

vllm_project retweeted

Eagle @EagleCorp

8 days ago

Today, EAGLE powers some of the industry’s most formative AI infrastructure companies and teams. With EAGLE 3.1, we’re taking another major step toward delivering a core piece of the fastest possible inference stack that exists, open to all. By improving hidden-state feedback stability and mitigating attention drift across deeper decoding steps, EAGLE 3.1 significantly improves long-context acceptance length and serving robustness in real-world inference environments. We are thrilled to collaborate with vLLM @vllm_project and TorchSpec @lightseekorg on advancing the next generation of inference acceleration infrastructure.

2

66

6

13

9K

vLLM

@vllm_project

Last Seen Users on Sotwe

Trends for you

Most Popular Users