How do we generate videos on the scale of minutes, without drifting or forgetting about the historical context?
We introduce Mixture of Contexts. Every minute-long video below is the direct output of our model in a single pass, with no post-processing, stitching, or editing.
1/4
Glad to share Seaweed-7B, a cost-effective foundation model for video generation. Our tech report highlights the key designs that significantly improve compute efficiency and performance given limited resources, achieving comparable quality against other industry-level models. To unleash the power of the foundation model, Seaweed-7B further enables a wide range of downstream applications including image-to-video generation, human video generation, subject-consistent video generation, video-audio joint generation, long video generation and storytelling, real-time generation, super-resolution generation, camera controlled generation.
Check out our webpage and report for more details:
Webpage: https://t.co/5s9Af4FQCb
Paper: https://t.co/GHVs4cvELt
It's a wonderful journey of the last year. Thanks to all teammates for their contributions, sincerely.
@rtk254 Ronen, interesting discussion! We recently have a work showing that training on synthetically generated CGI videos can indeed help models learn to generate videos that better respect physical constraints: https://t.co/HmOmX3uMEP
@ronen
@dreamingtulpa Thanks for reporting our work and discussion. Like mentioned in the paper's abstract: while the model still lacks a deep understanding of physics, it offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis.
We propose Long Context Tuning (LCT) for scene-level video generation to bridge the gap between current single-shot generation and real-world narrative video productions.
Homepage: https://t.co/1kA5LrNY8W
Report: https://t.co/8GF2hTSOXn
Want the deep dive?
• arXiv: https://t.co/2HLdzMyDEH
• Project Page: https://t.co/AUhbJmrwem
See how VideoAuteur + CookGen are shaping long narrative video generation.
Big shout out to my co-authors and advisors: @fncheng2333@liangkegui@YuilleAlan@roadjiang
Seaweed APT
Diffusion Adversarial Post-Training for One-Step Video Generation
Existing diffusion and autoregressive generative models require repeated neural network evaluations. It is extremely slow for the high-resolution video generation task, as a few-second video can take many minutes to generate. Our work is the first to demonstrate the generation of an entire video using a single neural function evaluation (1NFE) by using our proposed adversarial post-training technique. Our model generates 2 seconds of 1280x720 24fps videos in real-time. We showcase some of the results below:
Interesting comparison between our VideoPoet and other competitive models.
The comparison is incredibly helpful and reinforces my belief that VideoPoet excels in generating larger motions. We know the exact reasons for this and are working on improving single frame quality.
Google VideoPoet, Runway, Pika & Genmo
Google recently announced Video Poet.
Google's VideoPoet is a large language model (LLM) that is capable of a wide variety of video generation tasks, including:
- text-to-video
- image-to-video
- video stylization
- video inpainting and outpainting
- video-to-audio.
I tried some of their text-to-image prompts (from their demo) in Pika, Runway and Genmo. Here are the results:
10 examples
1/10
Two teddy bears holding hands, walking down rainy 5th avenue.
@anuaakash VideoPoet co-author here. Thanks a ton! Due to policy constraints, we weren't able to perform such comparisons. Your analysis is incredibly helpful and reinforces my belief that VideoPoet excels in creating larger motions. Its per frame quality can be further improved.
Excited to be at #NeurIPS2023 this week! Can't wait to reconnect with colleagues and make new connections. If you're up for a coffee chat, feel free to reach out.
Find me at our spotlight/posters.
https://t.co/QHpEz66JyP
Tue 12 5:15 p.m.
https://t.co/mFX52fktOm
Wed 13 10:45 a.m
We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇
😲While preparing the meta-review for #aaai24, I stumbled upon a new form of parallelism. It wasn't about the paper's concepts, but rather in the review comments, where two reviewers listed identical comments, word for word, over 200 matching words.
#PeerReview#AIResearch
📢 Call for Papers! International Journal of Computer Vision (IJCV) invites submissions for its special issue on "Generative Models for Content Creation and Manipulation."
🗓️ Manuscript Submission Deadline: February 28, 2024
🔗 Check it out here: https://t.co/S81We2MU7E
Fascinating research by Google reveals the power of Language Models (LLMs) like PaLM or GPT in tackling visual tasks using in-context learning. This novel method enables LLMs to perform image generation tasks without requiring any parameter updates. #palm#GPT4#LLMs