🤔On-policy (self) distillation has a prefix failure issue: when the student's rollout y_o is wrong (far from the teacher distribution), the teacher’s gradient signal becomes problematic.
🚨Introducing Trajectory-Refined Distillation (TRD), a trajectory-level refinement of y_o to solve the issue.
tldr: OPD/OPSD❌ TRD ���️
Paper: https://t.co/7q7dKMbUB5
🚨 New research work with @CHAI_Berkeley!
We provide the first multi-domain benchmark evaluating safety monitors for OOD misalignment detection by intentionally restricting the training dataset.
Special thanks to folks @MATSprogram and @haizelabs for providing valuable feedback and compute.
AI labs keep pumping in more and more safety training data to prevent safety failures based on observed test time failure modes. however, it's important to catch safety failures that we don't know beforehand ("unknown unknowns")/
cool work on using OOD detectors for catching such safety failures!
As @boazbaraktcs points out, the real challenge in AI safety isn’t the failures we expect—it’s the "unknown unknowns."
In our new paper with @dylanfeng_, @cassidy_laidlaw, and @ancadianadragan, we tackled this exact gap. We built the MOOD benchmark to show that traditional guard models often miss flagging these anomalous behaviours at test-time, but pairing them with robust OOD detectors (like Mahalanobis distance & perplexity) increases recall drastically—outperforming a standard guard model with 20x more parameters. 📈
Check out Boaz's post for the context on "unknown unknowns": 🔗 https://t.co/lubiwaZ7Tq
Check out our full paper 👇
We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵
If you��re at #ICLR2026, come say hello! 👋
I and @Harman26Singh will be at Pavilion 4 (P4-#4603) tomorrow from 3:15 PM – 5:45 PM discussing our work on Reward Modeling via Causal Rubrics —and the gigantic gains it brings to on-policy RL and TTS! 📈
#ICLR2026 #RL #Rubrics
Liner is partnering with @spoticlr at #ICLR2026 — supporting Best Paper and Travel Awards for LLM research.
And to celebrate, we're giving away:
✈️ Round-trip flights + hotel to #ICML2026 in Seoul
🎁 $300 Liner Credits
Follow @search_liner + repost to enter by 4/27.
Liner is built for research workflows. Find papers, verify sources, and write with citations in one place.
See you in 🇧🇷 and 🇰🇷!
@iclr_conf@icmlconf
Can LLMs Self-Verify? Much better than you'd expect.
LLMs are increasingly used as parallel reasoners, sampling many solutions at once.
Choosing the right answer is the real bottleneck.
We show that pairwise self-verification is a powerful primitive.
Introducing V1, a framework that unifies generation and self-verification:
💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling
💡 V1-Infer: Efficient tournament-style ranking that improves self-verification
💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers
🧵👇
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective.
Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings.
Paper + code + project website: https://t.co/j9BCBF7K3R
🧵 1/n
Checkout our latest work: Residual Context Diffusion Language Models (RCD) 🚀
- diffusion LLMs rely on "remasking," where low-confidence tokens are discarded at every step. This wastes valuable intermediate computation. RCD addresses this by recycling discarded tokens.
- We convert these representations into contextual residuals and inject them back into the next denoising step.
📄 Paper: https://t.co/ddeTX9Ehg3
🧵
Exciting to see much-needed progress on evaluating Indic language/culture understanding!
IndicGenBench shared these motivations and is one of the first generative evals for 29 Indic Languages!
https://t.co/hY3tmJez6G
@partha_p_t@nitish_gup
For agents to improve over time, they can’t afford to forget what they’ve already mastered.
We found that supervised fine-tuning forgets more than RL when training on a new task!
Want to find out why? 👇
I have deep respect for students grinding on NeurIPS rebuttal these days:
- running a brutal amount of experiments
- shaping them into a polished narrative
- all under a tight timeline
It’s an art + endurance test.
Awesome work on using checklists for RL! and great line of work coming out in this direction.
Also relevant, we recently found that generating instruction-specific causal rubrics can help create synthetic data for more robust reward model training. (which then helps do better alignment!)
https://t.co/HlWLn4ritQ
@imrahulmaddy@Pragya2k
🚨 Presenting Our Robust Reward Modeling work now at MoFA workshop @#ICML
📍 West Ballroom A , 12 - 1:30 as well as 4-5pm
Also catch me at DataWorld Workshop tomorrow while presenting the same
Exciting new RL tooling: A modular library for RL training by the Berkeley NovaSky team. While standard RL training is all done in one loop, it is more efficient for modern post-training to separate the generation of the rollouts from the trainer. It also enables asynchronous rollout generations or separate containers for the environments. It reminds me of the whole Hogwild! parallel SGD craze we all worked on, in distributed ML days, circa 2011.
Presenting:
Robust Reward Modeling via Causal Rubrics 📑
👉 https://t.co/sBLN2m4Ssx
A major problem for RLHF pipelines is reward hacking — RMs latch onto spurious cues (e.g. length, style) instead of true quality. Can we systematically tackle reward hacking by making reward models robust to spurious cues?
#RLAIF #CausalInference
🧵