🚀 slime v0.3.0 is out!
This release is a major step toward agent-first RL.
We turned slime’s existing multi-turn / agentic capabilities into a more coherent foundation:
- slime/agent with reusable sandbox-agent components
- OpenAI / Anthropic-compatible adapters
- black-box coding-agent RL example
- variable global batch-size training
- fully async training as a first-class path
- lower host-memory usage for more flexible rollout-inference setups
- PPO refactor with actor-critic colocation
- delta weight sync, FlashQLA for Qwen GDN, --save-hf, and more CI coverage
slime is moving closer to a practical open-source framework for large-scale agentic RL.
Release note:
https://t.co/e1ONv8Q4aW
Huge congrats to the Microsoft AI team on MAI-Thinking-1.
Great to see large-scale RL systems converging around the SGLang + Ray ecosystem. Rocket’s design—async RL, separated rollout / inference / learner pools, router-based traffic control, prefix caching, and fault-tolerant inference—is very aligned with what we believe in slime: RL is not just an algorithm problem, but a full-stack infrastructure problem.
Excited to see more open RL infra ideas validated at frontier scale!
Huge milestone for the Microsoft AI team: seven frontier MAI models, led by MAI-Thinking-1. Proud that SGLang powered the RL inference stack behind it. Their Rocket framework runs SGLang and the SGLang router for load balancing, traffic control, prefix caching, and graceful failure recovery across thousands of inference chips.
Congrats to the team @MicrosoftAI 👏
Read more on how SGLang powers the stack: https://t.co/60fxfv6DWb
slime v0.3.0: Built for the Agent Era
🌟 Insights from Zhihu contributor @朱小霖
@zhuzilinallen
There's little doubt that OpenClaw and Opus have kicked open the door to the Agent era.
slime's server-based engine + custom rollout architecture was built with this direction in mind. But as Agents become real-world workloads, it's clear that an RL framework needs more than basic Agent support—it needs better inference orchestration, long-horizon training, environment integration, and maintainable engineering practices.
That's exactly what slime v0.3.0 is about:
🔗 https://t.co/mdhH0jwXav
🚀 Agent-Native Infrastructure
The biggest risk for an RL framework isn't lacking features—it's chasing a new trend by piling temporary fixes onto an existing design. Agent training makes this especially tempting.
Instead of treating Agent support as one giant feature, we break it down into a series of infrastructure problems and solve them one by one.
🎱 Like in snooker, you don't clear the table with a single shot—you gradually create a better position. Many of the updates in slime v0.3.0 follow exactly this philosophy.
At its core, an RL framework is still about two things: inference and training. Let's look at them separately.
Faster & More Flexible Inference
Agent workloads dramatically increase token consumption and put much higher demands on serving systems. Two requirements stand out:
• Fast rollouts for long-horizon, multi-turn, tool-heavy tasks
• Production-like inference configurations so models can transition naturally from pretraining/SFT into deployment
To support this, slime expanded SGLang deployment with YAML-based multi-server configurations, allowing users to build composable server/router topologies instead of relying on a single inference setup.
📖 Docs:
https://t.co/9pxklYyC28
Many users now use slime as a launcher for complex SGLang clusters, which suggests people need more than an RL framework—they need a reliable infrastructure entry point.
We also improved --debug-rollout-only, making rollout-only and serving-only deployments much closer to production environments by cleanly separating inference and training resources.
Another trend we've observed: multi-turn interactions and tool usage significantly increase prefill pressure. Cache hit rates and memory capacity now directly impact rollout throughput.
Inspired by optimizations from the Miles team:
🔗 https://t.co/1Il0n70181
slime no longer offloads fp32 gradients and bf16 parameters in integrated training-serving workloads, saving roughly 6× parameter memory and improving rollout speed for Agent tasks.
🧠 Training for Long-Horizon Agents
On the training side, the focus shifts from infrastructure to algorithm design.
slime v0.3.0 adds support for compact and subagent workflows, where one prompt can generate multiple training samples.
Previously, frameworks often had to either:
• discard samples, wasting rollout data; or
• pad batches, increasing compute and memory costs.
Now, batch sizes can adapt dynamically to rollout results, eliminating both compromises while preserving proper normalization across related samples.
Long-horizon tasks are also driving renewed interest in reward shaping, value functions, and PPO-style algorithms.
To support this, slime rebuilt its PPO implementation so that actor and critic always share GPU resources, allowing users to move from GRPO to PPO without allocating an entirely separate GPU cluster.
It also supports independent Megatron configurations for actor and critic.
📖 Docs:
https://t.co/kVE9I5ciOX
Meanwhile, as Agent rollouts grow longer, more teams are adopting asynchronous training. In v0.3.0, fully async training has been promoted from an experimental example to a first-class feature, sharing the same interface as partial-rollout async workflows.
🤖 slime/agent:Solidify the common Agent components
While slime still encourages users to build their own custom harnesses, we've found that some Agent components are common enough to standardize.
That's why v0.3.0 introduces slime/agent/, including utilities like:
• trajectory merging
• OpenAI/Anthropic request interception
• reusable Agent tooling
We also released a complete Coding Agent RL example:
🔗 https://t.co/HFRf1RREzP
The example demonstrates an end-to-end pipeline where Claude Code operates inside a real environment, interacts through SGLang endpoints, logs requests via an Anthropic adapter, generates rewards automatically, and converts trajectories into trainable RL data.
🛠️ Maintaining Open Source in the Agent Era
As coding agents improve, software projects may split into two categories:
Projects that can be rewritten every time a stronger model arrives.
Projects whose value comes from years of accumulated design decisions, testing, edge cases, and user trust.
Training frameworks belong to the second category.
That creates two major risks when relying heavily on coding agents:
• Attention DDoS — code volume grows faster than maintainers can review and understand it.
• Loss of ownership — developers stop understanding why systems are designed the way they are, and architecture quality gradually degrades.
Because of this, slime remains conservative in core development. AI is used as a collaborator, reviewer, and coding assistant—not as the primary architect.
On the other hand, we've aggressively used AI for testing and visualization. Over the past few months, this approach has helped us build extensive CPU-only test coverage and improve framework stability.
The goal is simple: make slime not only battle-tested at scale, but also one of the most rigorously tested open-source RL frameworks available.
slime is approaching its first open-source anniversary. What started as a project maintained by one or two people has grown into a team effort.
We hope v0.3.0 makes Agent RL easier to build—and helps slime remain clear, lightweight, and reliable as the Agent era unfolds.
⭐ If slime has been useful to you, consider giving it a star:
https://t.co/p4yr6bWFXU
🔗Original article:https://t.co/6y7xETbaA6
#AI #Agents #RLHF #ReinforcementLearning #OpenSource #LLM #AgenticAI #SGLang #DeepLearning
Modal put it clearly: frontier RL is no longer just about algorithms — it is an infrastructure problem.
Happy to see slime used in Modal’s RL stack, and even happier to see real upstream contributions coming back to the open-source ecosystem.
The RL infra stack is still early. Let’s build it together!
Reinforcement learning has exploded on Modal, and we've been cooking.
Here's a review of lessons learned helping teams train at scale, the patterns we kept seeing, and an open-source library to get started with RL on Modal quickly.
At @modal, we're working to make sure OSS RL frameworks have all the techniques necessary to train frontier open-weights models.
Delta compression is key, but the job's not done. There are still lots of open problems around weight sync, auto-scaling, & cross-cluster training.
My DMs are open!
@FireworksAI_HQ + @cursor_ai highlighted why delta-compressed weight sync matters for RL at frontier scale.
slime brings this capability to OSS: lossless delta sync for Megatron ↔ SGLang disaggregation — ship deltas, not full checkpoints.
This is another step toward a fully open-source stack where rollout/inference and training are truly decoupled and deployed separately.
PR: https://t.co/OoFR2VJVn1
A couple of updates for Qwen 3.6 users in slime 🚀
• Merged FlashQLA support via PR #1947
→ Enable with --qwen-gdn-backend flashqla to use the official FlashQLA kernels and get the corresponding performance improvements.
• During the review and validation process, we uncovered an issue introduced during the SGLang 0.5.12.post1 upgrade that could affect some PD disaggregation deployments. The issue has now been fixed.
If you’ve recently run into PD-related issues, please pull the latest slime image and give it a try.
A nice example of how community contributions not only bring new features, but also help improve the robustness of the ecosystem ❤️
Huge thanks to everyone involved in the contribution and review!
PR: https://t.co/BXaQBTtUAk
Strongly agree.
As RL shifts from single-turn reasoning to long-horizon agents, rollout correctness becomes increasingly important.
A subtle re-tokenization mismatch can silently disconnect sampled trajectories from optimization targets, leading to gradients on sequences the model never actually generated.
This is why slime's multi-turn and agentic RL examples emphasize preserving sampled tokens across turns rather than reconstructing trajectories from rendered messages, including in black-box agent environments:
https://t.co/lfFriW3Oad
Not an implementation detail — a learning correctness issue.
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.
Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error.
The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal.
The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right.
Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL 🤗🔥
https://t.co/zmx0EQl3jM
slime was built for agentic RL from day 0.
We added an Agentic RL Training Roadmap that brings together the pieces already in slime for agent workflows: custom generation, verifier/test-based rewards, fan-out samples, async rollout, SGLang serving optimization, and coding-agent RL examples.
If you are training agents with tool use, sandbox interaction, subagents, context compaction, or test-based rewards, this is a good place to start.
Great to see 🌟Polar🌟 using slime as the demo training framework!
This is exactly why we open-source slime: to enable composable RL infrastructure where new rollout systems, real-world harnesses, and scalable training frameworks can work together seamlessly.
Excited for Polar × slime and the broader agentic RL ecosystem 🚀
Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change.
Find a problem, design the harness, and train your own agents! 🧵
Today, we are thrilled to officially launch RadixArk with $100M in Seed funding at a $400M valuation. The round was led by @Accel and co-led by @sparkcapital.
RadixArk exists to make frontier AI infrastructure open and accessible to everyone. Today, the systems behind the most capable AI models are concentrated in a small number of companies. As a result, most AI teams are forced to rebuild training and inference stacks from scratch, duplicating the same infrastructure work instead of focusing on new models, products, and ideas.
RadixArk was founded to change that. We are building an AI platform that makes it easier for teams to train and serve the best models at scale.
RadixArk comes from the open-source community. We started with SGLang, where many of us are core developers and maintainers, and expanded our work to Miles for large-scale RL and post-training. We will continue contributing to both projects and working with the community to make them the strongest open-source infrastructure foundations for frontier AI.
We would like to thank our long-term partners, contributors, and the broader SGLang community for believing in this mission. We're also grateful to @Accel and @sparkcapital, NVentures (Venture capital arm of @nvidia), Salience Capital, A&E Investment, @HOFCapital, @walden_catalyst, @AMD, LDVP, WTT Fubon Family, @MediaTek, Vocal Ventures, @Sky9Capital and our angel investors @ibab, @LipBuTan1, Hock Tan, @johnschulman2, @soumithchintala, @lilianweng, @oliveur, @Thom_Wolf, @LiamFedus, @robertnishihara, @ericzelikman, @OfficialLoganK, and @multiply_matrix among others.
Thanks for the exclusive interview with @MeghanBobrowsky at @WSJ about our vision.
slime environment & CI update:
- SGLang upgraded to 0.5.10.post1
- Megatron updated to dev commit 1dcf0dafa884, following the radixark/miles setup
- Part of Qwen2.5/3 CI replaced with Qwen3.5/3.6
- Added CI for PD disaggregation
- Added CI for GLM-4.7-Flash with transformers 5
This update improves compatibility with newer model and dependency stacks.
Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain.
https://t.co/81QCQw941P
In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement.
We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.
Training DeepSeek V4 @deepseek_ai at scale? SGLang + Miles is the Day 0 path. @lmsysorg
Miles and SGLang enable full-parameter RL training for DSV4 with stability, efficiency, and broad hardware support.
✅ Verified stability
- Rollout Routing Replay (R3) and indexer replay (experimental)
- Tensor-level validation across the Miles & Megatron mixed-precision training stack
- Step-0 train-inference diff: ~0.02–0.03
✅ Efficient full-parameter RL
- DP / TP / SP / EP / PP / CP support
- Tilelang attention and indexer kernels
- FP8/BF16 rollout and FP8/BF16 training support
✅ Broad hardware support
- Verified training on NVIDIA Hopper and Grace Blackwell clusters
- Ready for DeepSeek V4 RL from Day 0
This is the exclusive Day 0 path to scale DeepSeek V4 with rock-solid reliability. Full technical docs & setup guide below! 👇
#DeepSeekV4 #SGLang #RL