Spent a month writing CuteDSL kernels for an RL training loop.
The fused decode-attention kernel I wrote benchmarked 2.2x faster than the SDPA path it replaces.
Dropped it into HF generate. The decode step got 3x slower.
[Metrics and Explanation below]
<100% human-generated slop alert>
I don't actually know anything about RL, so inspired by @ericjang11 's podcast with Dwarkesh I was curious to see what it would take to crudely implement a version of PPO for connect 4 (a mathematically solved game but still interesting IMO).
The only rule I gave myself was that I was not allowed to vibe-code, only to "inverse" vibe-code. Meaning I only allowed myself to ask LLMs stack overflow type questions to check my intuition and understanding from reading the PPO paper. I would have to write my own code. In this sense the LLM was prompting me to write code, hence "inverse" vibe-coding.
The shape of these questions looked like e.g., "it doesn't make sense to record an autograd graph during rollout right since we don't actually backprop the rollouts" and "this value is frozen after rollout right?" I used Gemini 3.5 Flash this whole time and it's excellent for questions on well-established research.
For example, some of parts of the original PPO paper would have been tedious to implement for my purposes, so checking with Gemini that e.g., I could ignore the discounting of advantage estimates saved me a lot of time.
I started with a scaled down version that plays tic-tac-toe to sanity check my implementation. Once I got a reliable setup that yielded a model that would tie every game I moved on to connect 4. Currently a convnet with a few hundred thousand parameters trained on 32 million moves can play a casual game of connect 4 well (blocking obvious moves) but is straightforward to defeat for a skilled human.
I was pleasantly surprised by the state of CPU-only "deep" learning. A singlethreaded implementation on my 5600X achieves ~1600 moves/s so a small training run is easily doable in half a day or so.
Overall I'm very bullish on this way of learning things as inverse vibe-coding gives me a very fast feedback loop but doesn't make me feel like I'm outsourcing my thinking.
Introducing Search as Code, our new search architecture for AI agents.
It writes Python that calls our search stack directly, instead of looping through function calls one at a time.
Available in the Perplexity Agent API, and now default in Computer.
https://t.co/ut6GGWQTVO
New blog! Is frontier asynchronous RL solved?
The blog covers Async RL theory and infrastructure, surveying 8 open-weight frontier labs for the algorithmic techniques and systems fixes to handle train-inference mismatch. Also answered: why do current methods still fail at high policy lag? Which methods scale with horizon and compute?
At @modal, we're working to make sure OSS RL frameworks have all the techniques necessary to train frontier open-weights models.
Delta compression is key, but the job's not done. There are still lots of open problems around weight sync, auto-scaling, & cross-cluster training.
My DMs are open!
About a month ago I posted about ongoing work on datacenter scale inference simulation. People seemed to like it so we wrote more about it!
Check out this awesome blog post from the Dynamo team!
This is very intuitive and nicely executed.
System reminder/system prompt distillation is very popular in industry, but mostly done in off policy fashion. In that setup, we write scripts to "rejection sample" out hint leakage.
In the end, it is a comparison between on-policy and off-policy with rejection sampling. Which is better?
Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions
Read the full technical blog: https://t.co/B5tp4tdnim
The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline.
Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.
Excited to see some of our OPD research ideas land in the @verl_project ecosystem. 🎉
A few weeks ago, we released our work on On-Policy Distillation (OPD): Rethinking On-Policy Distillation of Large Language Models. In that paper, we introduced several diagnostic signals to understand why OPD succeeds — or mysteriously fails — at the token level.
Today, those diagnostics have been merged into the verl training framework. 🔥
The new metrics track:
🔷 top-k overlap between student and teacher token distributions.
🔷 overlap token advantage during distillation.
🔷 alignment dynamics on high-probability teacher tokens.
These signals came directly from our OPD analysis:
successful distillation is driven by overlapping high-probability token regions, while non-overlapping regions contribute almost no effective optimization signal.
🔥 What’s especially exciting is that this is no longer just a research observation: the same ideas have already been used in MiniCPM5 training.
Research → tooling → real models.
Big thanks to the verl community for the collaboration and merge! 🤝
📄 OPD Paper: https://t.co/KEMpOQ77ev
💻 Code Repo: https://t.co/aVzzlkf8mN
🎯 PR: https://t.co/9OsuFcYxZs