SGLang @SGL_Project - Twitter Profile

7 months ago

SGLang now runs natively on TPU with a new pure Jax backend! SGLang-Jax leverages SGLang's high-performance server architecture and uses Jax to compile the model's forward pass. By combining SGLang and Jax, it delivers fast, native TPU inference while maintaining support for advanced features like continuous batching, prefix caching, parallelism, speculative decoding, and highly optimized TPU kernels. Learn more in the blog below👇

lmsysorg's tweet photo. SGLang now runs natively on TPU with a new pure Jax backend!

SGLang-Jax leverages SGLang's high-performance server architecture and uses Jax to compile the model's forward pass. By combining SGLang and Jax, it delivers fast, native TPU inference while maintaining support for advanced features like continuous batching, prefix caching, parallelism, speculative decoding, and highly optimized TPU kernels.

Learn more in the blog below👇

5

104

21

34

62K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

🚀 Excited to see the open-source release of MiniMax-M2! Designed for advanced coding and autonomous workflows, MiniMax-M2 now runs seamlessly on SGLang with Day-0 support, delivering high-speed inference and smooth handling of long contexts. SGLang continues to drive the evolution of AI applications ready for real-world deployment. HuggingFace: https://t.co/iMT4PfIumX Github: https://t.co/1Z32ukiipy #MiniMaxM2 #SGLang

lmsysorg's tweet photo. 🚀 Excited to see the open-source release of MiniMax-M2! Designed for advanced coding and autonomous workflows, MiniMax-M2 now runs seamlessly on SGLang with Day-0 support, delivering high-speed inference and smooth handling of long contexts.
SGLang continues to drive the evolution of AI applications ready for real-world deployment.

HuggingFace: https://t.co/iMT4PfIumX
Github: https://t.co/1Z32ukiipy
#MiniMaxM2 #SGLang

2

93

14

21

127K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

⚡ Zero-overhead scheduler for speculative decoding ⚡ When your GPUs are running LLM inference, unoptimized software will waste a huge amount of time on CPU overhead - such as kernel launch and metadata bookkeeping. SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime since last year. Now, we also carefully tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board. This improvement has been tested by the @googlecloud vertex AI team and we welcome more people to join our development. See the roadmap below ⬇️

lmsysorg's tweet photo. ⚡ Zero-overhead scheduler for speculative decoding ⚡

When your GPUs are running LLM inference, unoptimized software will waste a huge amount of time on CPU overhead - such as kernel launch and metadata bookkeeping.

SGLang has been pioneering the zero-overhead CPU runtime for LLM runtime since last year. Now, we also carefully tune the scheduler for speculative decoding and seeing 10% - 20% speedup across the board.

This improvement has been tested by the @googlecloud vertex AI team and we welcome more people to join our development. See the roadmap below ⬇️

1

125

20

53

53K

sgl_project retweeted

Lianmin Zheng

@lm_zheng

7 months ago

This feature has finally been merged into the main branch. The team has been battling the PyTorch memory allocator and CUDA stream management for ages to iron out all the dependencies and race conditions.

5

217

18

72

24K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

Try deepseek-ocr using SGLang! python -m sglang.launch_server --model deepseek-ai/DeepSeek-OCR (optional) add --keep-mm-feature-on-device flag for better TTFT.

lmsysorg's tweet photo. Try deepseek-ocr using SGLang!

python -m sglang.launch_server --model deepseek-ai/DeepSeek-OCR

(optional) add --keep-mm-feature-on-device flag for better TTFT. https://t.co/Y4uFNmZB4S

0

24

5

11

4K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

🚀 SGLang Model Gateway v0.2 Drops 🚪 SGL-Router pioneered cache-aware routing last year. Now, it is fully rebuilt and renamed as “SGLang Model Gateway” - with extreme performance and much more features. Core upgrades: - Multi-Model Inference Gateway (IGW) Mode: Run multi-model fleets under one gateway—custom policies, health checks, load balancing, and flexible prefill-decode disaggregation. - Rust gRPC Powered: Bypass slow Python and HTTP runtime, extreme fast streaming, OpenAI-compatible APIs, cached tokenization! 🔥 - Pluggable Storage & MCP: Flexible history (memory/oracle) + seamless tool integration + response API. - Reliability Boost: Retries, metrics, tracing—all in. Your unified control plane for reasoning agents & enterprise LLMs. Backward compatible—easy migration! This is a huge contribution from the @Oracle team, led by Simo @hello_slin, Chang @ccskookie, Keyang @key4ng.

lmsysorg's tweet photo. 🚀 SGLang Model Gateway v0.2 Drops 🚪

SGL-Router pioneered cache-aware routing last year. Now, it is fully rebuilt and renamed as “SGLang Model Gateway” - with extreme performance and much more features.

Core upgrades:
- Multi-Model Inference Gateway (IGW) Mode: Run multi-model fleets under one gateway—custom policies, health checks, load balancing, and flexible prefill-decode disaggregation.
- Rust gRPC Powered: Bypass slow Python and HTTP runtime, extreme fast streaming, OpenAI-compatible APIs, cached tokenization! 🔥
- Pluggable Storage & MCP: Flexible history (memory/oracle) + seamless tool integration + response API.
- Reliability Boost: Retries, metrics, tracing—all in.

Your unified control plane for reasoning agents & enterprise LLMs. Backward compatible—easy migration!

This is a huge contribution from the @Oracle team, led by Simo @hello_slin, Chang @ccskookie, Keyang @key4ng.

6

101

18

44

36K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

Join PyTorch conference today to learn more about the latest progress from SGLang. - Optimize long-tail and MoE challenges in RL - General large scale inference optimization and deployment

0

15

3

1

3K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

We're excited to announce the collaboration between KTransformers and SGLang! KTransformers has been a killer for local AI inference with its system-algorithm co-design, often showing 5x - 10x speedup. This integration equips SGLang with KTransformers’ inference strategy and optimized kernels, specifically optimized for MoE models. Combined with SGLang’s native multi-GPU scaling, the solution can be seamlessly extended to serve much larger workloads. ⬇️ Learn more in our tech blog below

lmsysorg's tweet photo. We're excited to announce the collaboration between KTransformers and SGLang!

KTransformers has been a killer for local AI inference with its system-algorithm co-design, often showing 5x - 10x speedup.

This integration equips SGLang with KTransformers’ inference strategy and optimized kernels, specifically optimized for MoE models. Combined with SGLang’s native multi-GPU scaling, the solution can be seamlessly extended to serve much larger workloads.

⬇️ Learn more in our tech blog below

1

85

15

43

30K

sgl_project retweeted

LMSYS Org

@lmsysorg

7 months ago

Exciting updates on DGX Spark: Now you can run gpt-oss-20b at 70 tokens/s with SGLang! This is 1.4x faster than what we got in our blog last week. We worked with the @NVIDIAAIDev team to fix a bunch of Triton and quantization issues. Cannot wait to see how much performance we can get from this tiny computer. Usage: download the lmsysorg/sglang:spark docker image and launch with python3 -m sglang.launch_server --model openai/gpt-oss-20b

lmsysorg's tweet photo. Exciting updates on DGX Spark: Now you can run gpt-oss-20b at 70 tokens/s with SGLang! This is 1.4x faster than what we got in our blog last week.

We worked with the @NVIDIAAIDev team to fix a bunch of Triton and quantization issues. Cannot wait to see how much performance we can get from this tiny computer.

Usage: download the lmsysorg/sglang:spark docker image and launch with python3 -m sglang.launch_server --model openai/gpt-oss-20b

11

147

19

47

36K

sgl_project retweeted

NVIDIA AI Developer

@NVIDIAAIDev

7 months ago

🙌 We love seeing these performance gains of gpt-oss-20b at 70 tokens/s with SGLang (@lmsysorg) on NVIDIA DGX Spark. 👇

1

93

13

7

10K

sgl_project retweeted

Lianmin Zheng

@lm_zheng

7 months ago

1.4x speedup after one week of release!

4

145

7

17

15K

sgl_project retweeted

LMSYS Org

@lmsysorg

8 months ago

🚀 SGLang In-Depth Review of the NVIDIA DGX Spark is LIVE! Thanks to @NVIDIA’s early access program, SGLang makes its first ever appearance in a consumer product, the brand-new DGX Spark. The DGX Spark’s 128GB Unified Memory and Blackwell architecture set a new standard for local AI prototyping and edge computing. We're thrilled to bring these cutting-edge performance insights and software support to the developer community. Our review dives into how to efficiently deploy and accelerate large models like Llama 3.1 70B, GPT-OSS using SGLang's EAGLE3 speculative decoding and @Ollama on this beautiful piece of engineering. 👇 Unboxing video and tech blog in the thread #SGLang #NVIDIA #SparkSomethingBig #Blackwell #DGXSpark #AIInference #LLMServing

18

335

60

149

411K

sgl_project retweeted

Lianmin Zheng

@lm_zheng

7 months ago

MoE exposed interesting opportunities to fully utilize the heterogeneous hardware resources (CPU + GPU). KTransformer team is upstreaming their cool optimizations into the sglang stack to combine the best of both.

4

243

26

106

25K

sgl_project retweeted

Ying Sheng

@ying11231

9 months ago

What an incredible journey

2

57

2

3

9K

sgl_project retweeted

LMSYS Org

@lmsysorg

9 months ago

⚡️ Big update from Kimi K2! 256k context, Stronger coding & tool-calling, Smoother agent integration. Already tested with SGLang runtime — stable 60-100+ TPS with turbo API! 👉 Check it out: https://t.co/uepBg8HbMS

3

115

15

14

19K

sgl_project retweeted

LMSYS Org

@lmsysorg

11 months ago

🚀Summer Fest Day 3: Cost-Effective MoE Inference on CPU from Intel PyTorch team Deploying 671B DeepSeek R1 with zero GPUs? SGLang now supports high-performance CPU-only inference on Intel Xeon 6—enabling billion-scale MoE models like DeepSeek to run on commodity CPU servers. Key highlights: 1. Full CPU backend for SGLang with Intel AMX 2. Native BF16 / INT8 / FP8 support for both Dense and Sparse FFNs 3. 6–14× TTFT and 2–4× TPOT speedup vs. llama.cpp 4. 85%+ memory bandwidth efficiency with optimized MoE kernels 5. Flash Attention V2 + MLA + MoE all optimized for CPU 6. Multi-NUMA parallelism mapped from GPU-style Tensor Parallelism This work is now fully upstreamed to SGLang main—read how we achieved it, and how far you can go without a GPU 👇 #LLMInfra #ModelServing #MoE #Xeon6 #SGLang #FP8 #INT8 #CPUInference

lmsysorg's tweet photo. 🚀Summer Fest Day 3: Cost-Effective MoE Inference on CPU from Intel PyTorch team

Deploying 671B DeepSeek R1 with zero GPUs? SGLang now supports high-performance CPU-only inference on Intel Xeon 6—enabling billion-scale MoE models like DeepSeek to run on commodity CPU servers.

Key highlights:
1. Full CPU backend for SGLang with Intel AMX
2. Native BF16 / INT8 / FP8 support for both Dense and Sparse FFNs
3. 6–14× TTFT and 2–4× TPOT speedup vs. llama.cpp
4. 85%+ memory bandwidth efficiency with optimized MoE kernels
5. Flash Attention V2 + MLA + MoE all optimized for CPU
6. Multi-NUMA parallelism mapped from GPU-style Tensor Parallelism

This work is now fully upstreamed to SGLang main—read how we achieved it, and how far you can go without a GPU 👇

#LLMInfra #ModelServing #MoE #Xeon6 #SGLang #FP8 #INT8 #CPUInference

6

38

15

14

19K

sgl_project retweeted

LMSYS Org

@lmsysorg

11 months ago

🚨SGLang Summer Fest Bonus Drop🚨 Proud to share a joint effort from Mooncake by @Kimi_Moonshot, @Oracle , and SGLang: Kimi K2 trillion-scale deployment—running on 128 H200 GPUs sponsored by @NVIDIAAIDev DGX Cloud. OME + SGLang = MoE inference at production scale.👇

lmsysorg's tweet photo. 🚨SGLang Summer Fest Bonus Drop🚨
Proud to share a joint effort from Mooncake by
@Kimi_Moonshot, @Oracle , and SGLang: Kimi K2 trillion-scale deployment—running on 128 H200 GPUs sponsored by @NVIDIAAIDev DGX Cloud. OME + SGLang = MoE inference at production scale.👇 https://t.co/N5dKtPa27Z

5

111

23

25K

sgl_project retweeted

NVIDIA AI Infrastructure

@NVIDIAAIInfra

11 months ago

Proud to support this lightning-fast launch⚡️ ️ Accelerated through #NVIDIADGX Cloud and in partnership with Moonshot AI, @SGLang, and @Oracle Open Model Engine, we helped bring Kimi K2 to customers just days after its debut. Now, organizations can “Think Smart” and scale MoE inference with frontier performance. Explore how SGLang unlocked production-scale deployment ⤵️

2

76

14

3

10K

sgl_project retweeted

LMSYS Org

@lmsysorg

10 months ago

🚀 Introducing SpecForge – our open-source framework for speculative decoding training, built for SGLang and Eagle3. Train draft models that just work—scalable, efficient, and inference-ready. Supports LLaMA 4, DeepSeek, MoE, FSDP, TP & more. Up to 2.18× speedup. Huge thanks to our infra partner @VoltagePark, whose mission is to be a catalyst for innovation by democratizing access to high-performance AI infrastructure. Their support enabled us to train and evaluate large-scale speculative decoding models efficiently and reliably. We also would like to express our heartfelt gratitude to the Eagle3 team @hongyangzh, and LinkedIn Infra team @LinkedIn. Let’s build the future of fast LLMs together! #opensource #LLM #AI #SpeculativeDecoding

lmsysorg's tweet photo. 🚀 Introducing SpecForge – our open-source framework for speculative decoding training, built for SGLang and Eagle3. Train draft models that just work—scalable, efficient, and inference-ready. Supports LLaMA 4, DeepSeek, MoE, FSDP, TP & more. Up to 2.18× speedup.

Huge thanks to our infra partner @VoltagePark, whose mission is to be a catalyst for innovation by democratizing access to high-performance AI infrastructure. Their support enabled us to train and evaluate large-scale speculative decoding models efficiently and reliably.

We also would like to express our heartfelt gratitude to the Eagle3 team @hongyangzh, and LinkedIn Infra team @LinkedIn. Let’s build the future of fast LLMs together! #opensource #LLM #AI #SpeculativeDecoding

4

29

9

10

9K

sgl_project retweeted

Drew Houston

@drewhouston

11 months ago

@zhyncs42 love sglang!

3

16

2

0

2K

SGLang

@sgl_project

Last Seen Users on Sotwe

Trends for you

Most Popular Users