Huge milestone from the @anyscalecompute + @googlecloud GKE teams π
Ray Serve LLM provides up to 4.4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads than previous versions.
Three optimizations made this possible on the Ray Serve LLM + vLLM stack:
βοΈDirect streaming with a control-plane-only endpoint picker
βοΈ A new vLLM Ray V2 executor backend
βοΈHAProxy ingress for routing at the speed of C
Ray's primitives for fault tolerance, observability, and portability across K8s and VMs are a great foundation as inference deployments get more complex.
Congrats to the team! Try the new Ray V2 executor today in vLLM with --distributed-executor-backend ray.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLMβs production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns.
In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads π
π Congrats to @poolsideai on Laguna M.1, a new open-weights agentic coding model. Day-0 support landed in vLLM v0.21.0.
π§ 70-layer sparse MoE: 225B total params, 23B active per token, 256K context
π 256 experts with top-k=16 routing, built for long-horizon agentic coding
π οΈ Native interleaved reasoning between tool calls, toggleable per request, Apache 2.0
Recipe π https://t.co/lDG8poco5g
Today weβre releasing the weights for Laguna M.1,
our most capable model to date, with a 256K context length.
Both base and post-trained checkpoints are now available on Hugging Face under Apache 2.0.
Your coding agent can run on open models you host yourself, not just a hosted API.
vLLM serves them fast and cost-efficiently on your own GPUs, with broad hardware support across @NVIDIA, @AMD, and more. It speaks the same OpenAI Responses API that Codex uses, so any compatible agent points right at your server and any tool-calling model is a drop-in replacement.
Spin up the latest GLM 5.2 (@Zai_org), Kimi K2.7 Code (@Kimi_Moonshot), or MiniMax M3 (@MiniMax_AI) model, or whatever open model fits your needs, and start coding. π
Guide π https://t.co/EGNPBtlLB3
Serving Recipe: https://t.co/ftERFfutuf
Thanks for the kind words! Day 0 @MiniMax_AI M3 support came together thanks to this collaboration in the open.
Big kudos to @rogerw0108 and @mgoin_ for the ongoing push, review, and mentorship. More improvements landing soon. π
https://t.co/TbJJ8IUlMw
Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream:
NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!
A great deep dive from @SemiAnalysis_ on RL training systems and how much RL efficiency comes down to matching trainer and generator throughput!
Shoutout to @KaichaoYou and Ao Shen from @inferact for the sandbox scaling experiments with vLLM + verl, building on @KaichaoYou's early RL integration work across OpenRLHF, verl, and slimeπ«‘
RL Systems Mind the Gap:
Matching Trainer and Generator Throughput
RL Training Infrastructure, GRPO,
PipelineRL, Async RL, Policy Staleness,
RL Sandbox Infra, CPU Requirements,
TCO Analysis, Thinking Machines Tinker
https://t.co/yr5oH99h4B
π Day-0 support for in vLLM, available today in v0.23.0!
Congrats to @Zai_org on GLM-5.2, a flagship model built for long-horizon coding agents.
β¨ 1M-token context, built to hold project-scale engineering work in a single run
β¨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization
β¨ One task can carry a full dev workflow, from requirements to a deployable product across platforms
β¨ Client-side and mobile engineering, including an on-device debugging loop
Try it out running it on vLLM today:
π https://t.co/tRduouqn6e
Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
Great write-up from the @anyscalecompute team on PD disaggregation with Ray Serve + vLLM! PD Disagg is one of the most difficult techniques to get right in serving; the wins are real, but only in the right settings.
Great to see it pressure-tested on AMD MI325X with Ray Serve + vLLM!
One pattern we keep seeing with customers serving LLMs at scale:
Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced.
So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM β where the PD path has been much less paved. π§΅
Models, serving, and what to know before upgrading:
π New models: Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, Cohere Mini Code
π¦ Rust frontend grows up: a streaming generate endpoint, dynamic LoRA endpoints, /version + /server_info, and new tool parsers (InternLM2, Phi-4-mini, Gemma4)
π Security: SSL/TLS for the data-parallel supervisor, and out-of-vocab token IDs rejected before they reach the GPU logprob path
π Thanks to all 200 contributors this cycle (63 first-timers).
π Full release notes β https://t.co/lfAyYC0OXm
vLLM v0.23.0 is out! 408 commits from 200 contributors (63 new). π
Highlights: DeepSeek-V4 matures across backends (TRTLLM-gen attention kernel, sparse MLA decoupled from V3.2, EPLB for the Mega-MoE), Model Runner V2 now default for Llama + Mistral dense models, Gemma 4 Unified (encoder-free) + MTP, a maturing Rust frontend, multi-tier KV cache offloading with an object-store tier, and a unified reasoning + tool-call parser.
Thread π
Congrats to @vllm_project & @lmsysorg for releasing MiniMax M3 428B on both the CUDA & ROCm stack on day 0! MiniMax M3 includes:
π Block sparse attention which is 9x faster prefill over M2.7
π Day 0 open MXFP8 weights
π and Furthermore @Inferact released Day-0 EAGLE3 open weight draft model support
Excited to try out the performance on MiniMax M3!
Day-0 goes beyond inference: NeMo RL from @NVIDIAAI also supports MiniMax M3 on day 0, with vLLM powering rollout generation. π‘
A reference GRPO recipe is ready, so you can start post-training M3 for your own agentic workflows right away.
Branch: https://t.co/UPcMRXRkxP
Recipe: https://t.co/frO0BZTH1O
π Congrats to @MiniMax_AI on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model.
At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve.
M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware:
β¨ MSA sparse attention with dedicated prefill and decode kernels
β¨ 1M-token context serving with prefix caching and chunked prefill
β¨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell
β¨ Native multimodal input (image + video)
β¨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads
Day-0 support like this is a true team effort. Grateful to the teams at @MiniMax_AI, @NVIDIAAI, @AIatAMD, and @inferact, and to the vLLM community for making it happen. π
Deep dive into the implementation, kernel work, and deployment recipes:
π https://t.co/TbEc9VgqJ7
MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters
Weights:
https://t.co/g4Ybfa2kWH
MiniMax Sparse Attention:
https://t.co/HcTlWRotG3
π Congrats to @Kimi_Moonshot on Kimi K2.7-Code, a coding-focused agentic model built on K2.6.
β¨ 1T-parameter Mixture-of-Experts, 32B active per token
β¨ MLA attention with a 256K-token context window
β¨ ~30% fewer thinking tokens than K2.6 for more efficient reasoning
Supported in vLLM, reusing the same deployment as K2.6.
π https://t.co/Pe1eguuBDX
π Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!
π· Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
π· Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
π· Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.
β‘οΈ 6x High-Speed Mode coming soon!
π Available today via Kimi API and Kimi Code.
π Kimi Code: https://t.co/uvoSJKyGCY
π API: https://t.co/EOZkbOwCN4
Congrats to @GoogleDeepMind on DiffusionGemma π A 26B diffusion language model on the Gemma4 backbone, and the first dLLM natively supported in vLLM.
It denoises 256-token blocks in parallel instead of generating one token at a time: 1200+ output tok/s at batch size 1 on a single H200 (FP8).
Built on model runner v2's ModelState plus the existing speculative decoding path, with minimal scheduler or runner changes. FP8 and NVFP4 checkpoints are on the @RedHat_AI hub. Thanks to the @GoogleDeepMind, @RedHat_AI, and @NVIDIAAI teams!
π https://t.co/KrPmAoGpm2
Meet DiffusionGemma!
An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.
Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Hereβs whatβs new with DiffusionGemma: π
π Excited to see Inferoa from @agenticin.
It builds a community agent harness on the vLLM stack, with the agent loop shaped by inference economics: prefix-cache discipline, context optimization, and routing across self-hosted and frontier models.
Looking forward to seeing how developers extend it. π
Introducing Inferoa: Inference-native Tokenmaxxing Agent Harness built for Loop Engineering.
Building around @vllm_project to run recursive long-horizon tasks with discipline, and context optimization via #codegraph#rtk etc.
Try it at @ProductHunt!
https://t.co/iFcvewJ5c6