Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!
🌀Diversity Aware RL (DARLING)🌀
📝: https://t.co/MH0tui34Cb
- Jointly optimizes for quality & diversity using a learned partition function
- Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k
- Works for both non-verifiable & verifiable tasks
🧵1/5
Spent a month writing CuteDSL kernels for an RL training loop.
The fused decode-attention kernel I wrote benchmarked 2.2x faster than the SDPA path it replaces.
Dropped it into HF generate. The decode step got 3x slower.
[Metrics and Explanation below]
If you're serious about RL, you eventually need to get your hands dirty and do the math. Precision isn't an implementation detail, the gradient flow itself depends on it, infinitely more than in supervised training.
Check this out https://t.co/nyBeO4VxtR
and follow @DirhousssiAmine
I'm excited to announce that this Fall I will be joining the Computer Science Department at George Mason University as an Asst. Prof. I'll be expanding my lab and looking for PhD students to work on Multilingual AI problems text, video, and speech.
https://t.co/f6PIESWRp4
@ben_vandurme and I are recruiting multiple postdoc fellows at JHU. We're looking for candidates w/ strong record in language models, reasoning, coding agents, and/or AI for science.
Interested candidates should send their CV and a brief summary of their research interests to [email protected] / [email protected].
Please reshare for visibility. 🙏
🚀 Excited to share that our work, AMA-Bench, has been accepted to #ICML2026!
Most benchmarks test dialogue memory, but real agents learn through continuous environment interactions. We actually found that systems acing dialogue benchmarks completely struggle in true agentic settings! 🤯
To fix this, we introduce AMA-Bench to evaluate long-horizon memory in real applications, plus AMA-Agent—a new system designed to track causality and objective info across long trajectories. 🧠
🌐 Check it out: https://t.co/3y2wyXwVyL
See you at ICML! 🎉
Tbh i’m kinda sick of this academic doomerism vibe consuming all of bay area and the self-aggrandizing pov that frontier labs have. Sure a lot of exciting stuff is happening but we wouldn’t be where we are wo academia & there is sth to be said about the pursuit of curiosity.
New blog!
Covers a lot of papers and methods about recent advances in On policy distillation and On policy self distillation, their wins, their failure modes, and my opinion about the same!
Link below, please do check it out, and RT/QT if you like it:)
i was playing with Codex /goal on some lesser-known open conjectures, mostly 20–50y old. after letting it run autonomously for 8h+, i was already seeing what looked like publishable progress, even if not full resolutions.
weakly held take: people overrate “open for decades” as a proxy for importance.
unsolved ≠ important. a lot of old problems are just boring-but-hard, or maybe hard in the bad way / structurally not that productive. imo the higher-value thing is often accelerating recent research directions where the community actually has live taste / consensus that the topic matters.
these aren’t necessarily “harder” in some intrinsic sense. there are just way fewer participants because the prerequisite stack is brutal, vs more approachable combinatorics / Erdős-style problems. so the marginal AI researcher there may be much higher-value than grinding on random half-century-old open problems.
my stronger take: current models can already push some frontiers forward rapidly 95%-automatically, not “solve smooth 4D Poincaré today,” but real progress.
it’s underpriced because the domain people are conservative or slow to retool around AI, and the AI people mostly don’t know which deep problems exist / matter.
New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels"
This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors.
The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling.
At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths.
Blog link: https://t.co/t29Z6jVF87
Repo: https://t.co/3gsRd25QwL
I also put an extensive list of resources at the end, which I found very useful for interested readers.
Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out!
1 / xx
💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.
AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.
Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.
Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).
One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.
This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!
What are users thinking during their interactions with LLMs?
We introduce ThoughtTrace — the first large-scale dataset that captures what users think during real-world human–AI conversations, not just what they type.
→ 10,174 thought annotations
→ 2,155 multi-turn conversations, 17,058 turns
→ 1,058 users
→ 20 LLMs
These thoughts improve user behavior prediction (+41.7%) and model alignment (+25.6%).
This opens a new paradigm of user-centric LLM research. Full information in the thread 🧶
Read our paper: https://t.co/lRYJvGJ7bb
Check our project website: https://t.co/AupCn1YQOk
The bitter lesson in 26 words:
Don’t be distracted by human knowledge, as AI has been historically.
Instead focus on methods for creating knowledge that scale with computation, like search and learning.
Great question about how to come up with research ideas!
I think that different people have different ways, but here are some of mine:
1. Working with excellent people -- most of the ideas we tackle at CMU originally come from the people I work with, not me
2. Thinking about impact -- I always ask my collaborators "who is your target audience, and what results will they need to see to be convinced that this is promising"
3. Staying close to real-world use -- as you know, I have a startup and many of the ideas that we have for research come from practical problems we need to solve there too
(and sorry about the late reply, I needed to think about this before responding)
Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. 🐌
🚀 We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads:
🏋️ Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B
🧪 250k+ distilled coding trajectories -> SOTA ≤32B open coding agent
⚡ Fast evaluation on coding/cua/unified agent -> finish
Check our Blog: https://t.co/IBNqqbLqra
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights.
So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA.
Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models).
I think this idea of learning both fast-slow weights would be a good foundation for continual learning.
PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea.
See more details here:
https://t.co/FACsHx7IpK
Full post — inference systems, training recipes, reward design, eval, and a survey of Multiverse, Parallel-R1, NPR, ThreadWeaver + the original APR method (Pan et al., 2025):
https://t.co/iFQwVISoue
Co-authored with @tonylian!
"Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding"
Speculative decoding for RL rollouts!
This paper speeds up post-training without changing the target policy’s sampling distribution. So a draft model proposes multiple tokens, and the policy model verifies them.
An important system piece is that the draft must stay aligned with the continually updated policy, with weight sync and optional online adaptation.
This gives faster rollouts, but same learning trajectory. With ~1.5-1.8x rollout speedup at 8B, and projected ~2.5x end-to-end speedup at 235B scale.
Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings.
Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened.
But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)
Banger paper from Meta FAIR.
They introduce Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously.
The headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong solvers (43.7% vs 77.8%), while standard CoT Self-Instruct on the same setup produces a 1.9-point gap (71.4% vs 73.3%).
The agent generates questions that actually discriminate between models.
The method:
An orchestrator LLM directs a challenger agent to generate examples grounded in domain documents. A weak and a strong solver attempt them, a judge scores the outputs, and the orchestrator analyzes the failures and prompts the challenger to regenerate from new angles until quality thresholds are met.
The system also meta-optimizes itself.
An outer loop tunes the agent's instructions based on which harness changes lift validation pass rate. Over 126 accepted iterations, validation pass rate climbed from 12.8% to 42.4%. They processed 10,000+ CS papers and produced 2,117 quality-filtered QA pairs.
Existing self-instruct pipelines do not control data quality. Autodata reframes data generation as an agent loop, spend more inference compute and the data gets harder, which gives downstream RL a real lift.
Blog: https://t.co/41coXidxRI
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c