MAI is a really cool team of kind, highly motivated and skilled people.
Our team worked with them in the final stretch of this model contributing some of our swe 🧙♀️
proud of our Froggy team 🐸 and expect further cool updates from us...
One of the major hurdles I found is that KL posterior | prior is not quite as meaningful as in the standard variational parametrizations w continuous latents, as high amount of info about the target can be sneaked through one single token (eg the answer is 'X'). These problems are also evident in recent on policy self distillation papers where posterior can sneak info about targets. I think the urge of having an accurate posterior is more pressing when dealing w cots
Incredible work by the MAI team, building a model **without any distillation**.
Also shoutout to the Froggy Team 🐸 for their part in the RL data env creation. Go BugPilot!
Excited to partner with @Microsoft to enable everyone in the enterprise to build and deploy safe & secure Fabric data apps.
This is possible thanks to Microsoft's new Rayfin SDK.
🚀 slime v0.3.0 is out!
This release is a major step toward agent-first RL.
We turned slime’s existing multi-turn / agentic capabilities into a more coherent foundation:
- slime/agent with reusable sandbox-agent components
- OpenAI / Anthropic-compatible adapters
- black-box coding-agent RL example
- variable global batch-size training
- fully async training as a first-class path
- lower host-memory usage for more flexible rollout-inference setups
- PPO refactor with actor-critic colocation
- delta weight sync, FlashQLA for Qwen GDN, --save-hf, and more CI coverage
slime is moving closer to a practical open-source framework for large-scale agentic RL.
Release note:
https://t.co/e1ONv8Q4aW
@lateinteraction@novasarc01@lateinteraction are you optimizing the feedback model w the same objective (the one that gives c)? When working on DLN it was practically very hard to shape p(h|x,c) when c was already too strong.
when working on VinePPO, we wondered whether GRPO could have an implicit step-wise credit assignment just due to the similarity between positive and negative trajectories, definitely on my reading list!
Critic-free RL (e.g. GRPO) is very effective in LLM post-training, but why?
We propose the💥cancellation hypothesis💥: sequence-level rewards implicitly assign credits to individual tokens through the cancellation of gradients from pos/neg rollouts.
https://t.co/TsjWQN5mrD
Critic-free RL (e.g. GRPO) is very effective in LLM post-training, but why?
We propose the💥cancellation hypothesis💥: sequence-level rewards implicitly assign credits to individual tokens through the cancellation of gradients from pos/neg rollouts.
https://t.co/TsjWQN5mrD
Launching Agentick 🤖🧠
A unified benchmark for training and evaluating general sequential decision-making agents.
RL agents, LLMs, VLMs, hybrids, bots, and humans can all be evaluated on:
same tasks. same seeds. same score.
First result: no single agent dominates.
🧵
Markovian Thinker in the wild: Zyphra's ZAYA1-8B scales to >5M thinking tokens inside a 32K context window, reaching 91.9% AIME'25 with 0.7B active params. Bounded reasoning tails decouple thinking depth from attention cost, without any fancy linear attention for thinking.
I am moving to @ICComputing at @imperialcollege as an associate professor, where I will be expanding my lab!
I am looking for PhDs and postdocs to join me on my quest to build foundation models with adaptive tokenisation and memory (AToM FMs, funded by @ERC_Research)
Through what mechanisms can reasoning models learn faster by choosing what problems to train on, and what are the limits?
Part I of a new series: "Learning to Reason with Curriculum", where we explore algorithmic principles for overcoming the limitations of pre-trained models and data.
w/ Audrey Huang (@auddery), Miro Dudik (@MiroDudik), Rob Schapire, Dylan Foster (@canondetortugas) and Akshay Krishnamurthy. [1/12]
love this replay buffer paper from Meta:
https://t.co/JysdD9gLIn
"methods like PPO or GRPO typically operate as on-policy as possible, meaning rollouts are generated, used for a single gradient update, and immediately discarded."
this is crazy and we shouldn't do this!
interesting paragraph from incredibly interesting Mythos report!
"Claude Mythos Preview also exhibited several deficits in its research capabilities which hindered its performance, including lack of judgment about the quality of its ideas, insufficient hypothesis testing, and overconfident conclusions. These deficits—combined with time constraints—caused Claude Mythos Preview to fail to rediscover the final insight and complete the full task"
still at frontier:
- notion of "interestingness" of ideas - key to deep scientific discoveries / beyond local auto-research ideas;
- overconfidence - grounding tokens in actual verification results / uncertainty of the model;
- thorough hypothesis testing;