Sanjit Dandapanthula @sanjitdp - Twitter Profile

Pinned Tweet

about 21 hours ago

super excited to share our latest work! are we really tilting? 🤨 tldr: reward guidance for flows and diffusions is supposed to sample from the reward-tilted distribution. we show it doesn’t 😰 and how to (mostly) fix it ✨ plus lots of fun images!! 🖼️ collaboration with the awesome @nmboffi website: https://t.co/nvOaAiGYq1 paper: https://t.co/EtkeyiuX7s code: https://t.co/V3Bi4IVPbf

2

79

14

65

10K

Sanjit Dandapanthula

@sanjitdp

about 14 hours ago

@FEijkelboom thanks!

0

62

Sanjit Dandapanthula

@sanjitdp

about 21 hours ago

super excited to share our latest work! are we really tilting? 🤨 tldr: reward guidance for flows and diffusions is supposed to sample from the reward-tilted distribution. we show it doesn’t 😰 and how to (mostly) fix it ✨ plus lots of fun images!! 🖼️ collaboration with the awesome @nmboffi website: https://t.co/nvOaAiGYq1 paper: https://t.co/EtkeyiuX7s code: https://t.co/V3Bi4IVPbf

2

79

14

65

10K

sanjitdp retweeted

Nicholas Boffi

@nmboffi

about 21 hours ago

reward guidance is one of the most useful tricks we have for steering diffusions and flows at inference time. you take a learned generative process and nudge it toward high-reward samples, and in principle you're sampling from the reward-tilted measure. but are we really tilting?🤔 in practice, people use approximations, and the best one we have is a finite-particle plug-in estimate of the value function. it turns out that this approximation doesn't sample the tilted measure at all. we show that the bias induced causes reward hacking completely independent from the reward model -- this even occurs for gaussians with quadratic rewards! once you identify this problem, it turns out you can fix it analytically. @sanjitdp derives a closed-form reward damping schedule that corrects the within-mode bias at zero extra compute. it improves guidance at any number of particles, and with just a single particle it matches what used to take eight to sixteen, giving an order of magnitude reduction in compute. he also shows that simple plug-in estimators can't select modes in the tilted distribution. best-of-n can, and he highlights that best-of-n + damping gives the best performance in practice as a result. amazingly, the theory carries all the way from gaussian mixtures up to FLUX.1 text-to-image generation, leading to significant performance improvements at a fraction of the cost 🚀

1

89

9

82

10K

Sanjit Dandapanthula

@sanjitdp

about 21 hours ago

tagging a few people who might be interested! @YuanqiD @cdomingoenrich @msalbergo @max_simchowitz @karsten_kreis @peholderrieth @electronickale @bose_joey @AlexanderTong7 @ArashVahdat @k_neklyudov @DavidSHolz @ValentinDeBort1 @baaadas @JiequnH @du_yilun @ZhengyangGeng @sitanch @sedielem @gottapatchemall @FEijkelboom @osclsd @RickyTQChen @lipmanya

0

2

0

280

Sanjit Dandapanthula

@sanjitdp

about 21 hours ago

[5/5] recap: plug-in guidance reward hacks within modes and cannot select high-reward modes. we propose reward damping and best-of-n as partial fixes to these issues respectively. we conclude with image experiments that verify our theory at scale 🚀

1

0

228

Sanjit Dandapanthula

@sanjitdp

Last Seen Users on Sotwe

Trends for you

Most Popular Users