Max Lamparth @MLamparth - Twitter Profile

Pinned Tweet

7 days ago

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.

1

30

5

13

4K

Max Lamparth @MLamparth

1 day ago

Check it out: https://t.co/Qz3OijSqie "Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure" Joint work with @DanielFein7 , @andreas_h0wpt , @marcel_hussing , and @aiprof_mykel

0

1

0

94

Max Lamparth @MLamparth

1 day ago

Had a great time presenting our paper on Reward Bias Substitution as an oral at #RLEval Workshop at @CAISconf #CAIS2026 last week. Thanks to everyone who came by and asked such thoughtful questions!

MLamparth's tweet photo. Had a great time presenting our paper on Reward Bias Substitution as an oral at #RLEval Workshop at @CAISconf #CAIS2026 last week. Thanks to everyone who came by and asked such thoughtful questions! https://t.co/IE2umwtJb8

1

9

5

0

559

Max Lamparth @MLamparth

1 day ago

We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. https://t.co/JCJSQ8TuUF

Max Lamparth @MLamparth

7 days ago

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.

1

30

5

13

4K

1

0

1

0

260

Who to follow

Konstantin Pilz

@KonstantinPilz

Researcher on AI compute, data centers and U.S.-China competition

Anka Reuel | @ankareuel.bsky.social

@AnkaReuel

Computer Science PhD Student @ Stanford | Geopolitics & Technology Fellow @ Harvard Kennedy School/Belfer | Vice Chair EU AI Code of Practice | Views are my own

Jamie Bernardi

@The_JBernardi

Doing AI policy research, ex-Bluedot Impact. Climber, guitarist and sporadic musician. he/him.

MLamparth retweeted

Cas (Stephen Casper)

@StephenLCasper

4 days ago

Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do. With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.

StephenLCasper's tweet photo. Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do.

With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI. https://t.co/UpSU509i5r

71

806

25

48

51K

Max Lamparth @MLamparth

5 days ago

@boden_moraski @DanielFein7 @andreas_h0wpt @marcel_hussing @aiprof_mykel Thanks! Yes they do :) We cover implicit reward models in A12 and extend the framework to DPO and other direct alignment methods. The sufficiency classifier (Cor A.14) covers any KL-regularized optimum via the policy pair, and Cor A.15 gives an operator-side impossibility for DPO

0

49

Max Lamparth @MLamparth

7 days ago

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.

1

30

5

13

4K

Max Lamparth @MLamparth

6 days ago

@houjun_liu Thank you so much for the support , Jack!

0

2

0

140

MLamparth retweeted

Avanika Narayan

@Avanika15

7 days ago

Honored to be published in @ForeignAffairs. For 2 years we've made the case for local AI. Here's the geopolitical angle: the U.S.–China race isn't about who trains the best model. It's about whose models, chips, and frameworks run by default on billions of devices. Joint work w/@jdunnmon & @JonSaadFalcon (1/N)

Avanika15's tweet photo. Honored to be published in @ForeignAffairs.

For 2 years we've made the case for local AI. Here's the geopolitical angle: the U.S.–China race isn't about who trains the best model. It's about whose models, chips, and frameworks run by default on billions of devices.

Joint work w/@jdunnmon & @JonSaadFalcon

(1/N)

10

96

27

45

25K

MLamparth retweeted

Cassidy Laidlaw

@cassidy_laidlaw

9 days ago

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

cassidy_laidlaw's tweet photo. We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵 https://t.co/3wDFdXlIue

4

70

17

31

13K

Max Lamparth @MLamparth

7 days ago

Check out the paper: Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure Paper: https://t.co/Qz3OijSqie Code: https://t.co/7jxDlfABPu Joint work with @DanielFein7, @andreas_h0wpt, @marcel_hussing, and @aiprof_mykel

1

5

1

2

173

Max Lamparth @MLamparth

7 days ago

Across 25 published RLHF + DPO mitigations we surveyed, none provide the evidence the framework requires to certify they actually removed a bias rather than relocated it. Substitution is already widespread, routinely misclassified as anomaly, capability trade-off, or judge noise.

1

3

0

108

Max Lamparth

@MLamparth

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users