Pragya Srivastava

27 days ago

AI labs keep pumping in more and more safety training data to prevent safety failures based on observed test time failure modes. however, it's important to catch safety failures that we don't know beforehand ("unknown unknowns")/ cool work on using OOD detectors for catching such safety failures!

1

5

2

941

27 days ago

As @boazbaraktcs points out, the real challenge in AI safety isn’t the failures we expect—it’s the "unknown unknowns." In our new paper with @dylanfeng_, @cassidy_laidlaw, and @ancadianadragan, we tackled this exact gap. We built the MOOD benchmark to show that traditional guard models often miss flagging these anomalous behaviours at test-time, but pairing them with robust OOD detectors (like Mahalanobis distance & perplexity) increases recall drastically—outperforming a standard guard model with 20x more parameters. 📈 Check out Boaz's post for the context on "unknown unknowns": 🔗 https://t.co/lubiwaZ7Tq Check out our full paper 👇

Cassidy Laidlaw

@cassidy_laidlaw

27 days ago

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

cassidy_laidlaw's tweet photo. We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵 https://t.co/3wDFdXlIue

4

70

17

31

13K

1

11

3

6

3K

2 months ago

If you��re at #ICLR2026, come say hello! 👋 I and @Harman26Singh will be at Pavilion 4 (P4-#4603) tomorrow from 3:15 PM – 5:45 PM discussing our work on Reward Modeling via Causal Rubrics —and the gigantic gains it brings to on-policy RL and TTS! 📈 #ICLR2026 #RL #Rubrics

Pragya2k's tweet photo. If you��re at #ICLR2026, come say hello! 👋

I and @Harman26Singh will be at Pavilion 4 (P4-#4603) tomorrow from 3:15 PM – 5:45 PM discussing our work on Reward Modeling via Causal Rubrics —and the gigantic gains it brings to on-policy RL and TTS! 📈

#ICLR2026 #RL #Rubrics https://t.co/TWIvdE9VlF

12 months ago

🚨 New @GoogleDeepMind paper 𝐑𝐨𝐛𝐮𝐬𝐭 𝐑𝐞𝐰𝐚𝐫𝐝 𝐌𝐨𝐝𝐞𝐥𝐢𝐧𝐠 𝐯𝐢𝐚 𝐂𝐚𝐮𝐬𝐚�� 𝐑𝐮𝐛𝐫𝐢𝐜𝐬 📑 👉 https://t.co/EwF8HU7CGU We tackle reward hacking—when RMs latch onto spurious cues (e.g. length, style) instead of true quality. #RLAIF #CausalInference 🧵⬇️

3

85

19

40

22K

1

23

5

8

6K

Pragya2k retweeted

Liner

@search_liner

3 months ago

Liner is partnering with @spoticlr at #ICLR2026 — supporting Best Paper and Travel Awards for LLM research. And to celebrate, we're giving away: ✈️ Round-trip flights + hotel to #ICML2026 in Seoul 🎁 $300 Liner Credits Follow @search_liner + repost to enter by 4/27. Liner is built for research workflows. Find papers, verify sources, and write with citations in one place. See you in 🇧🇷 and 🇰🇷! @iclr_conf @icmlconf

search_liner's tweet photo. Liner is partnering with @spoticlr at #ICLR2026 — supporting Best Paper and Travel Awards for LLM research.

And to celebrate, we're giving away:
✈️ Round-trip flights + hotel to #ICML2026 in Seoul
🎁 $300 Liner Credits

Follow @search_liner + repost to enter by 4/27.

Liner is built for research workflows. Find papers, verify sources, and write with citations in one place.

See you in 🇧🇷 and 🇰🇷!

@iclr_conf @icmlconf

5

228

238

16

14K

4 months ago

Pairwise self-verification + co-training the generator and verifier is a clean idea for stronger test-time scaling.

Fahim Tajwar @FahimTajwar10

4 months ago

Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇

14

396

66

360

105K

0

3

0

246

Pragya2k retweeted

5 months ago

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: https://t.co/j9BCBF7K3R 🧵 1/n

14

806

160

728

208K

Pragya2k retweeted

5 months ago

Checkout our latest work: Residual Context Diffusion Language Models (RCD) 🚀 - diffusion LLMs rely on "remasking," where low-confidence tokens are discarded at every step. This wastes valuable intermediate computation. RCD addresses this by recycling discarded tokens. - We convert these representations into contextual residuals and inject them back into the next denoising step. 📄 Paper: https://t.co/ddeTX9Ehg3 🧵

Harman26Singh's tweet photo. Checkout our latest work: Residual Context Diffusion Language Models (RCD) 🚀

- diffusion LLMs rely on "remasking," where low-confidence tokens are discarded at every step. This wastes valuable intermediate computation. RCD addresses this by recycling discarded tokens.

- We convert these representations into contextual residuals and inject them back into the next denoising step.

📄 Paper: https://t.co/ddeTX9Ehg3

🧵

3

110

17

51

9K

Pragya2k retweeted

5 months ago

Rubrics can help make reward models more robust. This work will be presented at #ICLR2026 🇧🇷! @Pragya2k @imrahulmaddy

1

51

5

29

5K

Pragya2k retweeted

8 months ago

Exciting to see much-needed progress on evaluating Indic language/culture understanding! IndicGenBench shared these motivations and is one of the first generative evals for 29 Indic Languages! https://t.co/hY3tmJez6G @partha_p_t @nitish_gup

0

6

3

0

379

Pragya2k retweeted

Jyo Pari

@jyo_pari

10 months ago

For agents to improve over time, they can’t afford to forget what they’ve already mastered. We found that supervised fine-tuning forgets more than RL when training on a new task! Want to find out why? 👇

jyo_pari's tweet photo. For agents to improve over time, they can’t afford to forget what they’ve already mastered.

We found that supervised fine-tuning forgets more than RL when training on a new task!

Want to find out why? 👇

19

935

156

832

169K

10 months ago

A neat idea for solving reward hacking in Process Reward Models!

Chenlu Ye @ye_chenlu

10 months ago

PROF🌀Right answer, flawed reason?🤔🌀 📄https://t.co/8kFrxKQbVW Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning

ye_chenlu's tweet photo. PROF🌀Right answer, flawed reason?🤔🌀
📄https://t.co/8kFrxKQbVW
Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀
Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning https://t.co/5GNivwEh1L

2

36

11

14

4K

0

5

0

373

Pragya2k retweeted

Sharon Li

@SharonYixuanLi

11 months ago

I have deep respect for students grinding on NeurIPS rebuttal these days: - running a brutal amount of experiments - shaping them into a polished narrative - all under a tight timeline It’s an art + endurance test.

13

444

14

64

48K

Pragya2k retweeted

11 months ago

Awesome work on using checklists for RL! and great line of work coming out in this direction. Also relevant, we recently found that generating instruction-specific causal rubrics can help create synthetic data for more robust reward model training. (which then helps do better alignment!) https://t.co/HlWLn4ritQ @imrahulmaddy @Pragya2k

1

3

1

0

221

11 months ago

Catch our Robust Reward Modeling paper at #ICML MoFA today & DataWorld tomorrow!