vLLM @vLLM_project - Twitter Profile

4 days ago

Huge milestone from the @anyscalecompute + @googlecloud GKE teams 🎊 Ray Serve LLM provides up to 4.4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads than previous versions. Three optimizations made this possible on the Ray Serve LLM + vLLM stack: ⭐️Direct streaming with a control-plane-only endpoint picker ⭐️ A new vLLM Ray V2 executor backend ⭐️HAProxy ingress for routing at the speed of C Ray's primitives for fault tolerance, observability, and portability across K8s and VMs are a great foundation as inference deployments get more complex. Congrats to the team! Try the new Ray V2 executor today in vLLM with --distributed-executor-backend ray.

Seiji Eicher

@seiji_________

4 days ago

Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns. In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

seiji_________'s tweet photo. Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns.

In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

1

50

7

20

20K

1

86

23

25

8K

vLLM

@vllm_project

4 days ago

🎉 Congrats to @poolsideai on Laguna M.1, a new open-weights agentic coding model. Day-0 support landed in vLLM v0.21.0. 🧠 70-layer sparse MoE: 225B total params, 23B active per token, 256K context 🔀 256 experts with top-k=16 routing, built for long-horizon agentic coding 🛠️ Native interleaved reasoning between tool calls, toggleable per request, Apache 2.0 Recipe 🔗 https://t.co/lDG8poco5g

Poolside

@poolsideai

4 days ago

Today we’re releasing the weights for Laguna M.1, our most capable model to date, with a 256K context length. Both base and post-trained checkpoints are now available on Hugging Face under Apache 2.0.

poolsideai's tweet photo. Today we’re releasing the weights for Laguna M.1,
our most capable model to date, with a 256K context length.
Both base and post-trained checkpoints are now available on Hugging Face under Apache 2.0. https://t.co/gMWuYo8zN1

40

1K

117

579

605K

4

172

19

36

15K

vLLM

@vllm_project

4 days ago

Your coding agent can run on open models you host yourself, not just a hosted API. vLLM serves them fast and cost-efficiently on your own GPUs, with broad hardware support across @NVIDIA, @AMD, and more. It speaks the same OpenAI Responses API that Codex uses, so any compatible agent points right at your server and any tool-calling model is a drop-in replacement. Spin up the latest GLM 5.2 (@Zai_org), Kimi K2.7 Code (@Kimi_Moonshot), or MiniMax M3 (@MiniMax_AI) model, or whatever open model fits your needs, and start coding. 🚀 Guide 🔗 https://t.co/EGNPBtlLB3 Serving Recipe: https://t.co/ftERFfutuf

Tibo

@thsottiaux

5 days ago

Reminder that you can use the Codex App, CLI and SDK with any open source model, not just with OpenAI models. https://t.co/spPifB4ck3

480

7K

705

4K

2M

5

115

14

33

11K

vLLM

@vllm_project

4 days ago

Thanks for the kind words! Day 0 @MiniMax_AI M3 support came together thanks to this collaboration in the open. Big kudos to @rogerw0108 and @mgoin_ for the ongoing push, review, and mentorship. More improvements landing soon. 🙌 https://t.co/TbJJ8IUlMw

SemiAnalysis

@SemiAnalysis_

4 days ago

Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream: NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!

3

87

9

17

35K

1

54

4

5

5K

vLLM

@vllm_project

5 days ago

A great deep dive from @SemiAnalysis_ on RL training systems and how much RL efficiency comes down to matching trainer and generator throughput! Shoutout to @KaichaoYou and Ao Shen from @inferact for the sandbox scaling experiments with vLLM + verl, building on @KaichaoYou's early RL integration work across OpenRLHF, verl, and slime🫡

SemiAnalysis

@SemiAnalysis_

6 days ago

RL Systems Mind the Gap: Matching Trainer and Generator Throughput RL Training Infrastructure, GRPO, PipelineRL, Async RL, Policy Staleness, RL Sandbox Infra, CPU Requirements, TCO Analysis, Thinking Machines Tinker https://t.co/yr5oH99h4B

8

409

52

473

184K

4

65

7

27

18K

vLLM

@vllm_project

5 days ago

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to @Zai_org on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗 https://t.co/tRduouqn6e

Z.ai @Zai_org

6 days ago

Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: https://t.co/LAsxUdN0JZ Weights: https://t.co/g0A1C4UWx4 API: https://t.co/Kc3E22cbN7 Coding Plan: https://t.co/Nk8Y98HNhU Chat: https://t.co/WCqWT0qCQb

Zai_org's tweet photo. Introducing GLM-5.2: Frontier Intelligence, Open Weights

- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1

Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb

617

11K

2K

4K

6M

5

357

40

62

34K

vLLM

@vllm_project

6 days ago

Great write-up from the @anyscalecompute team on PD disaggregation with Ray Serve + vLLM! PD Disagg is one of the most difficult techniques to get right in serving; the wins are real, but only in the right settings. Great to see it pressure-tested on AMD MI325X with Ray Serve + vLLM!

kourosh hakhamaneshi

@CyrusHakha

7 days ago

One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced. So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved. 🧵

CyrusHakha's tweet photo. One pattern we keep seeing with customers serving LLMs at scale:

Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced.

So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved. 🧵

2

37

13

20

16K

1

107

12

58

12K

vLLM

@vllm_project

7 days ago

Models, serving, and what to know before upgrading: 🆕 New models: Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, Cohere Mini Code 🦀 Rust frontend grows up: a streaming generate endpoint, dynamic LoRA endpoints, /version + /server_info, and new tool parsers (InternLM2, Phi-4-mini, Gemma4) 🔒 Security: SSL/TLS for the data-parallel supervisor, and out-of-vocab token IDs rejected before they reach the GPU logprob path 🙏 Thanks to all 200 contributors this cycle (63 first-timers). 📖 Full release notes → https://t.co/lfAyYC0OXm

1

10

0

3

3K

vLLM

@vllm_project

7 days ago

vLLM v0.23.0 is out! 408 commits from 200 contributors (63 new). 🎉 Highlights: DeepSeek-V4 matures across backends (TRTLLM-gen attention kernel, sparse MLA decoupled from V3.2, EPLB for the Mega-MoE), Model Runner V2 now default for Llama + Mistral dense models, Gemma 4 Unified (encoder-free) + MTP, a maturing Rust frontend, multi-tier KV cache offloading with an object-store tier, and a unified reasoning + tool-call parser. Thread 👇

vllm_project's tweet photo. vLLM v0.23.0 is out! 408 commits from 200 contributors (63 new). 🎉

Highlights: DeepSeek-V4 matures across backends (TRTLLM-gen attention kernel, sparse MLA decoupled from V3.2, EPLB for the Mega-MoE), Model Runner V2 now default for Llama + Mistral dense models, Gemma 4 Unified (encoder-free) + MTP, a maturing Rust frontend, multi-tier KV cache offloading with an object-store tier, and a unified reasoning + tool-call parser.

Thread 👇

14

461

43

110

38K

vLLM

@vllm_project

7 days ago

Hardware & performance: 🟢 NVIDIA: FP8 FlashInfer attention for ViT, Triton MoE on Hopper by default, CUTLASS FP8 scaled-mm padding bypass (+20%), MoE-permute buffer pre-alloc (+9–14%), NUMA auto-binding on DGX B300 🔴 AMD ROCm: ROCm 7.2.3, native W4A16 + fused-MoE W4A16 kernels for RDNA3 (gfx1100), AITER top-k/top-p sampler by default, attention-sink support in AITER FA 🔵 Intel XPU: vllm-xpu-kernel v0.1.7, block FP8 MoE, a DeepSeek-V4 attention decode path, transparent sleep mode 💻 CPU & more: zentorch-accelerated W8A8/W4A16 on AMD Zen CPUs, RISC-V RVV WNA16 helpers, a PowerPC SHM communicator, an arm64 CI image

2

20

0

2

4K

vLLM

@vllm_project

9 days ago

Glad to see day-0 speculators are warm welcomed by the community!

SemiAnalysis

@SemiAnalysis_

9 days ago

Congrats to @vllm_project & @lmsysorg for releasing MiniMax M3 428B on both the CUDA & ROCm stack on day 0! MiniMax M3 includes: 🟠 Block sparse attention which is 9x faster prefill over M2.7 🟠 Day 0 open MXFP8 weights 🟠 and Furthermore @Inferact released Day-0 EAGLE3 open weight draft model support Excited to try out the performance on MiniMax M3!

SemiAnalysis_'s tweet photo. Congrats to @vllm_project & @lmsysorg for releasing MiniMax M3 428B on both the CUDA & ROCm stack on day 0! MiniMax M3 includes:

🟠 Block sparse attention which is 9x faster prefill over M2.7
🟠 Day 0 open MXFP8 weights
🟠 and Furthermore @Inferact released Day-0 EAGLE3 open weight draft model support

Excited to try out the performance on MiniMax M3!

2

79

6

16

25K

1

57

2

7

7K

vLLM

@vllm_project

10 days ago

Day-0 goes beyond inference: NeMo RL from @NVIDIAAI also supports MiniMax M3 on day 0, with vLLM powering rollout generation. 💡 A reference GRPO recipe is ready, so you can start post-training M3 for your own agentic workflows right away. Branch: https://t.co/UPcMRXRkxP Recipe: https://t.co/frO0BZTH1O

2

22

2

1

2K

vLLM

@vllm_project

10 days ago

🎉 Congrats to @MiniMax_AI on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at @MiniMax_AI, @NVIDIAAI, @AIatAMD, and @inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗 https://t.co/TbEc9VgqJ7

MiniMax (official) @MiniMax_AI

10 days ago

MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters Weights: https://t.co/g4Ybfa2kWH MiniMax Sparse Attention: https://t.co/HcTlWRotG3

113

3K

330

541

682K

4

302

31

55

40K

vLLM

@vllm_project

10 days ago

@Kimi_Moonshot thanks to @verdacloud for providing the compute to verify k2.7 on @NVIDIAAI 's GB300 and more!

1

11

2

1

2K

vLLM

@vllm_project

10 days ago

🎉 Congrats to @Kimi_Moonshot on Kimi K2.7-Code, a coding-focused agentic model built on K2.6. ✨ 1T-parameter Mixture-of-Experts, 32B active per token ✨ MLA attention with a 256K-token context window ✨ ~30% fewer thinking tokens than K2.6 for more efficient reasoning Supported in vLLM, reusing the same deployment as K2.6. 🔗 https://t.co/Pe1eguuBDX

vllm_project's tweet photo. 🎉 Congrats to @Kimi_Moonshot on Kimi K2.7-Code, a coding-focused agentic model built on K2.6.

✨ 1T-parameter Mixture-of-Experts, 32B active per token
✨ MLA attention with a 256K-token context window
✨ ~30% fewer thinking tokens than K2.6 for more efficient reasoning

Supported in vLLM, reusing the same deployment as K2.6.

🔗 https://t.co/Pe1eguuBDX

Kimi.ai @Kimi_Moonshot

10 days ago

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: https://t.co/uvoSJKyGCY 🔗 API: https://t.co/EOZkbOwCN4

Kimi_Moonshot's tweet photo. 🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!

🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.

⚡️ 6x High-Speed Mode coming soon!
🔌 Available today via Kimi API and Kimi Code.

🔗 Kimi Code: https://t.co/uvoSJKyGCY
🔗 API: https://t.co/EOZkbOwCN4

634

14K

2K

3K

2M

18

854

46

86

61K

vLLM

@vllm_project

12 days ago

Congrats to @GoogleDeepMind on DiffusionGemma 🎉 A 26B diffusion language model on the Gemma4 backbone, and the first dLLM natively supported in vLLM. It denoises 256-token blocks in parallel instead of generating one token at a time: 1200+ output tok/s at batch size 1 on a single H200 (FP8). Built on model runner v2's ModelState plus the existing speculative decoding path, with minimal scheduler or runner changes. FP8 and NVFP4 checkpoints are on the @RedHat_AI hub. Thanks to the @GoogleDeepMind, @RedHat_AI, and @NVIDIAAI teams! 🔗 https://t.co/KrPmAoGpm2

Google Gemma

@googlegemma

12 days ago

Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

167

5K

811

2K

941K

14

521

51

137

39K

vLLM

@vllm_project

12 days ago

🎉 Excited to see Inferoa from @agenticin. It builds a community agent harness on the vLLM stack, with the agent loop shaped by inference economics: prefix-cache discipline, context optimization, and routing across self-hosted and frontier models. Looking forward to seeing how developers extend it. 🚀

Agentic Intelligence Lab @agenticin

12 days ago

Introducing Inferoa: Inference-native Tokenmaxxing Agent Harness built for Loop Engineering. Building around @vllm_project to run recursive long-horizon tasks with discipline, and context optimization via #codegraph #rtk etc. Try it at @ProductHunt! https://t.co/iFcvewJ5c6

0

14

4

10

10K

4

58

5

25

9K

vLLM

@vllm_project

Last Seen Users on Sotwe

Trends for you

Most Popular Users