EmbeddedLLM

@EmbeddedLLM

Your open-source AI ally. We are committed to making production-grade AI inference as accessible and reliable as electricity, powered by vLLM.

Joined October 2023

1.4K Following

1.1K Followers

522 Posts

EmbeddedLLM retweeted

Kaichao You

@KaichaoYou

5 days ago

vLLM ❤️ AMD

EmbeddedLLM retweeted

Simon Mo

@simon_mo_

6 days ago

Amazing work! More and more RL frameworks are using vLLM as default. @vllm_project along with @anyscalecompute and @NovaSkyAI revamped weight syncing and improved wide-ep deployment for rollout!

EmbeddedLLM retweeted

vLLM

@vllm_project

6 days ago

We've shipped two major upgrades for RL✨! 1. Native weight syncing APIs: Standardizes weight transfer, provides optimized implementations for NCCL and CUDA IPC out of the box, and also lets frameworks easily bring their own. 2. Improved pause/resume for Async RL: Careful coordination between DP ranks so that engines don’t deadlock. Validated at scale in P/D, wide-EP setups! In collaboration with @anyscalecompute, @NovaSkyAI, and @RedHat. More and more RL frameworks are using vLLM as the default for inference, details in the blog 👇 https://t.co/LLmL8zJLtR

vllm_project's tweet photo. We've shipped two major upgrades for RL✨!

1. Native weight syncing APIs: Standardizes weight transfer, provides optimized implementations for NCCL and CUDA IPC out of the box, and also lets frameworks easily bring their own.
2. Improved pause/resume for Async RL: Careful coordination between DP ranks so that engines don’t deadlock. Validated at scale in P/D, wide-EP setups!

In collaboration with @anyscalecompute, @NovaSkyAI, and @RedHat.

More and more RL frameworks are using vLLM as the default for inference, details in the blog 👇

https://t.co/LLmL8zJLtR

212

11K

EmbeddedLLM retweeted

Simon Mo

@simon_mo_

9 days ago

🦀 rustifying vLLM, one part at a time, great work @BugenZhao!

EmbeddedLLM retweeted

vLLM

@vllm_project

9 days ago

🦀 The Rust frontend is officially merged into vLLM! As GPUs get faster, the frontend has become a real share of CPU time. The new Rust frontend is a drop-in alternative to the Python API server — same engine, same ZMQ boundary. Opt in with VLLM_USE_RUST_FRONTEND=1. Early numbers: on a preprocess-heavy workload, ~837 req/s vs ~162 req/s for default Python — ~5x in a single process. A few design choices we're excited about: • Layered crates with clear boundaries • Stream-native pipeline — non-streaming for free • Builds on stable Rust Huge thanks to @BugenZhao from @inferact for introducing the work at @PyTorch Meetup Singapore. https://t.co/Tw8PoIjbH9

917

104

250

84K

EmbeddedLLM retweeted

Inferact

@inferact

14 days ago

Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 We brought together top researchers and AI systems engineers for an afternoon of rich conversations on @vllm_project, the frontier of inference, and where AI systems are headed next. Huge thanks to everyone who joined — the energy in the room was something else. This is exactly the kind of cross-pollination between labs, infra teams, and industry that pushes the whole stack forward. More to come. 👀 #MLSys2026 #vLLM

EmbeddedLLM retweeted

vLLM

@vllm_project

14 days ago

A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 https://t.co/bHmyFNZPEg

$vllm_project's tweet photo. A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 https://t.co/bHmyFNZPEg$

207

25K

EmbeddedLLM retweeted

vLLM

@vllm_project

15 days ago

🎉 Day-0 vLLM support for Command A+! Congrats to @cohere on their most powerful open-source model yet. 🧠 218B MoE / 25B active, Apache 2.0 🌍 Multimodal + 48 languages ⚡ Runs on as little as 2× H100s @ W4A4 Serve it now in vLLM! 🚀 📖 https://t.co/NKFHejTNtl

200

31K

EmbeddedLLM retweeted

vLLM

@vllm_project

15 days ago

🎉 Congrats to the VeRL-Omni team on the pre-release of a general RL post-training framework for multimodal generative models. Built on verl + vllm-omni. vLLM-Omni handles the multimodal rollout with step-wise continuous batching and embedding caching; vLLM serves the VLM-as-judge / OCR reward model, overlapped with rollout and training. In the Qwen-Image OCR demo, moving the reward to its own GPU cuts per-step wall-clock by ~14%. Released: Qwen-Image with FlowGRPO / MixGRPO / GRPO-Guard. BAGEL and Qwen3-Omni-Thinker PR-ready. Excited to push multimodal generative RL forward together with VeRL-Omni and the broader community. 🙌 📖 https://t.co/D2TpU8EU9J 🔗 https://t.co/uwmO3EWwcS

vllm_project's tweet photo. 🎉 Congrats to the VeRL-Omni team on the pre-release of a general RL post-training framework for multimodal generative models. Built on verl + vllm-omni.

vLLM-Omni handles the multimodal rollout with step-wise continuous batching and embedding caching; vLLM serves the VLM-as-judge / OCR reward model, overlapped with rollout and training. In the Qwen-Image OCR demo, moving the reward to its own GPU cuts per-step wall-clock by ~14%.

Released: Qwen-Image with FlowGRPO / MixGRPO / GRPO-Guard. BAGEL and Qwen3-Omni-Thinker PR-ready.

Excited to push multimodal generative RL forward together with VeRL-Omni and the broader community. 🙌

📖 https://t.co/D2TpU8EU9J
🔗 https://t.co/uwmO3EWwcS

158

22K

EmbeddedLLM

@EmbeddedLLM

20 days ago

@inferact Huge congrats on the second office🚀 Go vLLM!

237

EmbeddedLLM

@EmbeddedLLM

22 days ago

$AMD is unstoppable now. @AIatAMD ROCm flywheel is spinning hard: persistent MI355X access for @vllm_project @EmbeddedLLM is proud to help power that loop. AI inference is infrastructure. And the next era of AI infra won’t be won by default distribution. It’ll be won by kernels, compilers, runtimes, and relentless execution. We’re here for that fight. 🚀

SemiAnalysis

@SemiAnalysis_

22 days ago

The shock came when on Day 0 DeepSeekv4 launch, since the community vLLM/SGLang maintainers only had access to NVIDIA GPUs, they were only able to add Day 0 NVIDIA GPU support. Since then, AMD has finally priotitzed with actions and just not words by contributing an 2.5million dollar 9 node Mi355X cluster with 8x400G Pollara NICs towards open source @vllm_project maintainers including (@EmbeddedLLM maintainers, @simon_mo_ and others) & an 1.1 million dollar dev 4 node+ MI355 dev cluster. 2\4

217

EmbeddedLLM retweeted

vLLM

@vllm_project

24 days ago

Great read from the @RedHat_AI team — a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2 release. If you're considering TurboQuant for your workload, this is the data to start from. 📝 https://t.co/YkQIsyuDUN

281

205

34K

EmbeddedLLM retweeted

vLLM

@vllm_project

27 days ago

🚀 vLLM-Omni v0.20.0 is out — aligned with upstream vLLM v0.20.0 (CUDA 13.0 · PyTorch 2.11 · Transformers 5.x). ⚡ Qwen3-Omni throughput +72% on H20, 32 conc (0.241 → 0.414 req/s) via talker / code2wav multi-replica scaling 🎙️ TTS faster & leaner: VoxCPM2 RTF 0.946 → 0.106 · Fish Speech Fast AR latency -53% · Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB 🎨 Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency 🆕 New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 📱 Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP — +50–60% perf 🧮 Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image 🧩 Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out → https://t.co/uCTQxa2dQk

234

17K

EmbeddedLLM retweeted

Kaichao You

@KaichaoYou

27 days ago

This is growth-hacking dressed up in open-source language, @radixark please stop doing it immediately. Paying people in platform credits to star a GitHub repo and repost a marketing tweet isn't "fueling the community" — it's laundering paid promotion through the trust signals open source depends on. Stars are supposed to mean someone found a project useful. Attach a $200 bounty and the number means nothing. GitHub's own policies prohibit this for exactly that reason.

KaichaoYou's tweet photo. This is growth-hacking dressed up in open-source language, @radixark please stop doing it immediately.

Paying people in platform credits to star a GitHub repo and repost a marketing tweet isn't "fueling the community" — it's laundering paid promotion through the trust signals open source depends on. Stars are supposed to mean someone found a project useful. Attach a $200 bounty and the number means nothing. GitHub's own policies prohibit this for exactly that reason.

285

45K

EmbeddedLLM retweeted

Zyphra

@ZyphraAI

29 days ago

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

ZyphraAI's tweet photo. Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density.

With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵 https://t.co/URTj1br9tw

101

291

EmbeddedLLM retweeted

vLLM

@vllm_project

29 days ago

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: https://t.co/XIRtQ9pYVQ 🧵👇

231

153

58K

EmbeddedLLM retweeted

vLLM

@vllm_project

30 days ago

🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image! ⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation! Check out the full vLLM recipes for Gemma 4 model series👇 https://t.co/IrCaaa6SIo

vllm_project's tweet photo. 🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image!

⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation!

Check out the full vLLM recipes for Gemma 4 model series👇
https://t.co/IrCaaa6SIo https://t.co/eFcAZRogLF

898

551

89K

EmbeddedLLM retweeted

Yixin Dong @yi_xin_dong

about 1 month ago

Introducing XGrammar-2: structured generation for complex agent harnesses. Strict tool-calling formats. Built-in DeepSeek-V4 and Qwen-3.6 support. Up to 80x speedup over XGrammar. Ready-to-use integrations with vLLM, SGLang, TensorRT-LLM, and more! ⚡ From Claude Code to OpenClaw, agents are defining more complex harnesses. XGrammar-2 ensures LLMs always interact with them in the right way. Built in collaboration with DeepSeek, Databricks, and leading frontier AI labs to bring XGrammar-2 into latest models and products. 🧩 Structural Tag: one unified abstraction to describe any format your agent needs 🚀 Scales to 500+ strictly typed tools for complex agent harnesses 🌐 Native APIs in Python, C++, Rust, and JS, running everywhere from cloud to edge 🛠️ Integrated with vLLM, SGLang, TensorRT-LLM, and more Excited to see what agent builders create with it! Blog: https://t.co/N0Tbl588BH GitHub: https://t.co/lo4yScuI2f

yi_xin_dong's tweet photo. Introducing XGrammar-2: structured generation for complex agent harnesses.

Strict tool-calling formats. Built-in DeepSeek-V4 and Qwen-3.6 support. Up to 80x speedup over XGrammar. Ready-to-use integrations with vLLM, SGLang, TensorRT-LLM, and more! ⚡

From Claude Code to OpenClaw, agents are defining more complex harnesses. XGrammar-2 ensures LLMs always interact with them in the right way.

Built in collaboration with DeepSeek, Databricks, and leading frontier AI labs to bring XGrammar-2 into latest models and products.

🧩 Structural Tag: one unified abstraction to describe any format your agent needs
🚀 Scales to 500+ strictly typed tools for complex agent harnesses
🌐 Native APIs in Python, C++, Rust, and JS, running everywhere from cloud to edge
🛠️ Integrated with vLLM, SGLang, TensorRT-LLM, and more

Excited to see what agent builders create with it!

Blog: https://t.co/N0Tbl588BH
GitHub: https://t.co/lo4yScuI2f

149

42K

EmbeddedLLM retweeted

Kaichao You

@KaichaoYou

about 1 month ago

@inferact builds on top of vLLM, improves vLLM, and ships optimizations back to vLLM🤩 These optimizations are built on the foundation of the vLLM community's contributions—we added some contributions on top, and more to come💪🏻

EmbeddedLLM retweeted

Simon Mo

@simon_mo_

about 1 month ago

Don't miss this. @woosuk_k added MegaMoE support to vLLM's DeepSeek V4 code in one day 🫡

EmbeddedLLM

@EmbeddedLLM

Last Seen Users on Sotwe

Trends for you

Most Popular Users