Huge thanks to @_akhaliq for sharing our work!
We introduce TUNA, a unified multimodal model that handles both image/video understanding and generation/editing. The key is a unified, end-to-end learned visual representation.
Excited to share STARFlow2 from Apple MLR :
🥨Bridging Language Models and Normalizing Flows for Unified Multimodal Generation.
One model to understand, reason, and generate continuous images with a single unified autoregressive mechanism?
Paper: https://t.co/IA1pJ5AtOX
1/9
🚀 We are excited to announce the release of AnyFlow, the first any-step video diffusion on-policy distillation (OPD) framework. By leveraging Flow Map distillation, AnyFlow significantly enhances model inference efficiency by reducing sample steps. (Code, models, and demos are now open-source!)
Key Highlights:
⚡ Any-Step Generation: Unlike traditional distilled models tied to fixed step budgets, AnyFlow enables a single model to adapt to arbitrary inference budgets. It achieves high-quality few-step generation while providing stable improvements as more sampling steps are added.
🔀 Multiple Architectures: AnyFlow supports any-step distillation for both causal and bidirectional video diffusion models.
🎬 Multiple Tasks: AnyFlow supports Text-to-Video, Image-to-Video, and Video-to-Video generation within one causal video diffusion model.
📈 Scalable Performance: AnyFlow is validated from 1.3B up to 14B parameters.
📄 Paper: https://t.co/Qqik8l29oB
💻 Code: https://t.co/KOMv9RtuWu
🎨 Pre-trained Models: https://t.co/Br1MNllUu8
🎬 Demo: https://t.co/hxbl56lPFU
New paper: AsymFlow🔥
JiT x0-prediction is not enough for pixel generation. Better keep velocity in a low-rank subspace:
- 1.57 FID on ImageNet (best pixel flow model)
- Finetunes FLUX.2 klein into pixel space, beats the original on HPSv3/DPG/GenEval (#1 overall on HPSv3)
1/7
GOOGLE 🔥: An upcoming Gemini Omni video model from Google is expected to be much more advanced in video editing, capable of completing tasks like removing watermarks, replacing objects in the video, and more.
It is also likely that Google will release 2 versions of this model, including a Pro variant.
And I assume what we see isn't Pro?
Anime sample 👀
Most current models that handle both image understanding and image generation rely on separate pretrained components: a vision encoder (a network like CLIP that turns images into feature vectors) for understanding, and a VAE — a compression network mapping images into a smaller latent representation — for generation.
The authors of this paper remove both and feed raw pixel patches directly into a single transformer trained end to end.
The results show that pretrained vision encoders are not necessary for multimodal modeling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
Read with an AI tutor: https://t.co/FCC0qeeHWV
PDF: https://t.co/M2866qdBNS
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
Great thread about multimodal models.
Just because a single Transformer can do it all, that doesn't mean doing things this way makes the most sense economically. (Although it probably will, eventually!)
Cool paper from Meta suggesting that future MLLMs will be Native Multimodal Models (NMM), hence no vision encoders anymore
But I disagree
I actually think we'll go in the other direction (what? more encoders? yes! read on...)
All you need to know about the future of MLLMs 🧵
Introducing SubQ - a major breakthrough in LLM intelligence.
It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA),
And the first frontier model with a 12 million token context window which is:
- 52x faster than FlashAttention at 1MM tokens
- Less than 5% the cost of Opus
Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention).
Only a small fraction actually matter.
@subquadratic finds and focuses only on the ones that do.
That's nearly 1,000x less compute and a new way for LLMs to scale.
Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.
Now it is 0.75, and can be even lower.
Many wonder how.
I thought it might end as a small FID prank: simple and deliberate.
It started with one question: can FID be optimized directly, and what does it reveal?
Introducing FD-loss.
1/🚀 Excited to announce Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation!
We built an omni model utilizing direct patch embedding layers for raw image inputs and achieves SOTA in multimodal understanding AND generation.
Paper: https://t.co/rk0tIB4tbt
Code: https://t.co/OSAos8k33x
Thanks to all the co-authors! @__Johanan, @wmren993, @xiaoke_shawn_h, @ShoufaChen, @TianhongLi6, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, @WenhuChen, Ping Luo, @LukeZettlemoyer!
Nicely done. I would also shamelessly recommend my collection on test time training, which has overlap with online learning topic. https://t.co/DxiqczlDyh