Diffusion models treat every part of an image equally.
→ Same number of steps. Same compute.
But images aren’t uniform. 🤔
Some regions are easy, others are hard.
So why force the model to treat them the same? 🧵
⚠️ Standard first stages are not sufficient for safety-critical applications!
The most extreme weather events are often the hardest to decode.
One latent → many plausible reconstructions
Deterministic decoders hide that uncertainty.
Meet FREUD 🧵👇
Diffusion models treat every part of an image equally.
→ Same number of steps. Same compute.
But images aren’t uniform. 🤔
Some regions are easy, others are hard.
So why force the model to treat them the same? 🧵
⚠️ Standard first stages are not sufficient for safety-critical applications!
The most extreme weather events are often the hardest to decode.
One latent → many plausible reconstructions
Deterministic decoders hide that uncertainty.
Meet FREUD 🧵👇
The internet is full of video. So why can't novel view synthesis just scale on it?
Real-world video is simultaneously unposed, messy, and dynamic, breaking self-supervised NVS.
We fixed that. RayDer learns static-scene NVS from dynamic internet video, scaling like an LLM. A🧵
Diffusion models treat every part of an image equally.
→ Same number of steps. Same compute.
But images aren’t uniform. 🤔
Some regions are easy, others are hard.
So why force the model to treat them the same? 🧵
@JiaweiYang118 We mainly focus on improving generation with a given budget. But we also show empirically that we can a) achieve better performance with better allocated compute (lower NFE) and b) are orthogonal to caching
Takeaway 🚀
• Diffusion shouldn’t treat all regions equally
• Patch-wise timesteps improve performance, if done right
• Allocating compute where it matters gives further gains
Project Page: https://t.co/CNFqzK7DoG
arXiv: https://t.co/s56wDM5pSx
Code: https://t.co/uos07FxTSd