Thanks to @_akhaliq for featuring our work! InstructVideo addresses key challenges in video generation by integrating human feedback into video diffusion models. Excited to see how InstructVideo advances AI-driven video creation! 🚀 #AI#VideoGeneration#InstructVideo
Alibaba announces InstructVideo: Instructing Video Diffusion Models with Human Feedback
paper page: https://t.co/6mg9XheORk
Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities.
🌟UniLumos — a unified framework for image & video relighting with physics-plausible feedback!
UniLumos learns lighting consistency in static & dynamic scenes — much faster and more physically grounded⚡
💻 Code: https://t.co/IMNO4BVkzb
Also in ComfyUI-WanVideo!
#neurips2025
Unleash the resolution of your SDXL without cost. 🚀FreeScale🚀, a tuning-free method for higher-resolution visual generation, unlocking the 8k image generation! #FreeScale#SDXL
- Project: https://t.co/cdkjJU77J0
- Code: https://t.co/vnyE3zOvP0
- Paper: https://t.co/naKY9gmiho
Throughout my PhD, I've found one basic trick to read papers in less than 30 minutes but with maximum utility. It boils down to consuming actively, not passively: 🧵 1/5
After I joined the industry, I realize more and more how fragile and infeasible building your business on proprietary LLMs is. An 86% open-weight model >> an 89% proprietary API. Open source is the future!
I will be presenting InstructVideo on June 19th from 17:15 to 18:45 at Arch 4A-E (poster 162) . Feel free to reach out! I am more than happy to have discussions on this.🥳
Alibaba announces InstructVideo: Instructing Video Diffusion Models with Human Feedback
paper page: https://t.co/6mg9XheORk
Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities.
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
https://t.co/YYpOAcrXQ3
Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”
🎯Align GenAI with Human Preference🎯
#InstructVideo instructs video diffusion models with human feedback by reward fine-tuning, enhancing the video generation quality/aesthetics
- Project: https://t.co/1vILMrZEkI
- Paper: https://t.co/vYO9wnMIf5
- Code: https://t.co/UcDH4OmTu9
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
paper page: https://t.co/GjisYPlDr0
Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
paper page: https://t.co/zj3ip5Y8My
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280times720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data.
Dear #ICCV2023 attendees, my laptop was stolen from my backpack on Monday (02.10) from room S01 and another laptop was stolen from S06. Both laptops were taken from 2 to 3 pm. If you have any photos/videos from these rooms at those times, I appreciate if you share them with me.