super excited to share our latest work! are we really tilting? ๐คจ
tldr: reward guidance for flows and diffusions is supposed to sample from the reward-tilted distribution. we show it doesnโt ๐ฐ and how to (mostly) fix it โจ
plus lots of fun images!! ๐ผ๏ธ
collaboration with the awesome @nmboffi
website: https://t.co/nvOaAiGYq1
paper: https://t.co/EtkeyiuX7s
code: https://t.co/V3Bi4IVPbf
super excited to share our latest work! are we really tilting? ๐คจ
tldr: reward guidance for flows and diffusions is supposed to sample from the reward-tilted distribution. we show it doesnโt ๐ฐ and how to (mostly) fix it โจ
plus lots of fun images!! ๐ผ๏ธ
collaboration with the awesome @nmboffi
website: https://t.co/nvOaAiGYq1
paper: https://t.co/EtkeyiuX7s
code: https://t.co/V3Bi4IVPbf
reward guidance is one of the most useful tricks we have for steering diffusions and flows at inference time. you take a learned generative process and nudge it toward high-reward samples, and in principle you're sampling from the reward-tilted measure.
but are we really tilting?๐ค
in practice, people use approximations, and the best one we have is a finite-particle plug-in estimate of the value function. it turns out that this approximation doesn't sample the tilted measure at all.
we show that the bias induced causes reward hacking completely independent from the reward model -- this even occurs for gaussians with quadratic rewards!
once you identify this problem, it turns out you can fix it analytically. @sanjitdp derives a closed-form reward damping schedule that corrects the within-mode bias at zero extra compute. it improves guidance at any number of particles, and with just a single particle it matches what used to take eight to sixteen, giving an order of magnitude reduction in compute.
he also shows that simple plug-in estimators can't select modes in the tilted distribution. best-of-n can, and he highlights that best-of-n + damping gives the best performance in practice as a result.
amazingly, the theory carries all the way from gaussian mixtures up to FLUX.1 text-to-image generation, leading to significant performance improvements at a fraction of the cost ๐
[5/5] recap: plug-in guidance reward hacks within modes and cannot select high-reward modes. we propose reward damping and best-of-n as partial fixes to these issues respectively. we conclude with image experiments that verify our theory at scale ๐