Cosmos 3 is out! It's our latest family of Omni World Foundation Models for Physical AI. It uses a Mixture-of-Transformers (MoT) architecture to unify a reasoner and a generator tower into a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.
It is now a leading open-source model across understanding, reasoning, generation, and action benchmarks for Physical AI. Webpage: https://t.co/PSFO1sxim4
Check out Sahil and Mengqi's work!
I'm also at #CVPR2026 -- if you want to talk pre-training, evals, data, world models, or how we built Cosmos 3 (and everything that broke along the way :)), down to chat.
We are presenting WFM-Eval at two @CVPR 2026 workshops in Denver 📍
🗓️ Jun 3, Video World Models
Poster 9:50–10:40 AM, Exhibit Hall A
🗓️ Jun 4, Foundation Models Meet Embodied Agents
Poster 3:55–4:30 PM
Come say hi 👋
Work done with @AmberZhang99@prithvijitch@judyfhoffman
Cosmos 3 is out! It's our latest family of Omni World Foundation Models for Physical AI. It uses a Mixture-of-Transformers (MoT) architecture to unify a reasoner and a generator tower into a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.
It is now a leading open-source model across understanding, reasoning, generation, and action benchmarks for Physical AI. Webpage: https://t.co/PSFO1sxim4
NVIDIA's Cosmos 3 lands at #1 among open weights models in both Text to Image and Image to Video on the Artificial Analysis Leaderboards!
Cosmos 3 is a family of omnimodal world models for Physical AI from @nvidia, unifying language, image, video, audio and action in a single Mixture-of-Transformers architecture that pairs an autoregressive reasoner with a diffusion generator.
The family comes in four variants: base Nano (16B: 8B reasoner tower + 8B generator tower) and Super (64B: 32B reasoner tower + 32B generator tower) models, with the Super model also having Text2Image and Image2Video fine-tuned variants, which are the versions listed in the Artificial Analysis Arena Leaderboards.
Cosmos3-Super-Text2Image (agentic) runs through an agentic prompt-upsampling harness, and takes the #1 open weights spot in Text to Image, surpassing HiDream-O1-Image-Dev-2604, Alibaba's Qwen Image Max 2512 and Black Forest Labs' FLUX.2 [dev].
Cosmos3-Super-Image2Video takes #1 open weights in Image to Video (No Audio), ahead of Lightricks' LTX-2, and Alibaba's Wan 2.2 A14B.
Cosmos 3 generators take structured JSON prompts rather than plain text, so prompt upsampling is needed to reproduce these results. This upsampling can be handled by an external harness or by the model's own reasoner branch, so it can also run self-contained.
Cosmos 3 is fully open under the OpenMDW 1.1 license, shipping with weights, code, curated datasets and fine-tuning recipes available on @huggingface. First-party and third-party APIs are expected over the next few weeks, with pricing to follow.
See the thread below for example generations and a link to try Cosmos 3 in our arena 🧵
Look what we’re cooking! Cosmos 3 is a family of unified omnimodal world model (language, image, video, audio, action), topping multiple benchmarks! Proud to have led Cosmos3-Super-Image2Video, now the #1 open I2V model on Artificial Analysis. Hope it empowers the community!
We're open-sourcing Cosmos 3 today along with a technical report detailing what went into building it. This project pushed us through some genuinely hard problems, and the report tries to capture the depth of that work. It has been a privilege to be able to contribute to different aspects of this project. This was a huge team effort!
Technical Report: https://t.co/8rgwNhvxde
[1/6] Ego-centric World Models
We introduce EgoWM — a video world model that simulates EVE-1X humanoid interactions from a single ego-view image + full-body joint angle trajectories.
Moreover it effortlessly generalizes to extreme OOD domains, including paintings !
I’m thrilled to share that Cosmos Reason 2 is here, our latest open, high-accuracy reasoning vision-language model for physical AI.
Read our blog to learn more 📖 https://t.co/mmlziCbcbl
Download Cosmos Reason 2 👉 https://t.co/oV2KWwkOVf
🤔Want a principled way to RL your diffusion model?
Check Data-regularized Reinforcement Learning (DDRL)! Post-train @nvidia#Cosmos World Foundation models with a million GPU hours! 🤯
Novel formulation ➡️ Theoretically integrates SFT into RL ➡️ Robust to Reward Hacking 🛑
Details: https://t.co/1A9q8ho2xb
#DDRL #Diffusion #RL #NVIDIA #Cosmos
🚀Try out rCM—the most advanced diffusion distillation!
✅First to scale up sCM/MeanFlow to 10B+ video models
✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support
✅High quality & diversity videos in 2~4 steps
Paper: https://t.co/xZZK25oIrJ
Code: https://t.co/aPAo1MO0JQ
[ICCV 25] Refer Everything Model (REM)
(1/6) We leverage Text-to-Video Generation models to zero-shot segment any concept in a video using text. REM generalises to dynamic concepts like smoke, light-beam and more without ever having seen segmentation masks for these entities.
(1/n)
Can pretrained video diffusion models be prompted to track pixels — without any retraining?
We introduce Point-Prompting, a zero-shot point tracking method that simply prompts video models to visually mark and propagate points across time.
🌐 https://t.co/ZhNTp7e8zt
Vision tokenizers are stuck in 2020🤔while language models revolutionized AI🚀
Language: One tokenizer for everything
Vision: Fragmented across modalities & tasks
Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND understanding in a single transformer framework.
Paper: https://t.co/wiN4WJDV6I | Code & models coming soon 🧵
Catch our #CVPR2025 poster today!
🖼️ “A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation”
📍 Exhibit Hall D, Poster #230
🕓 4:00–6:00 PM
We explore how LLMs perform as text encoders for image generation—with some interesting findings!
🔗 Webpage: https://t.co/lUkWuHdHuk
📄 Paper: https://t.co/mOOCQfXZ2P
Amazing work by @Andrewzzzwang and @Songwei_Ge during their internship at NVIDIA
The WorldModelBench workshop is happening tomorrow (June 12th) at #CVPR2025! We have an exciting series of talks, do attend!
Place: Room 108
Time: Morning Session
#NVIDIAResearch
Join us at the WorldModelBench workshop at #CVPR2025 where we'll tackle systematic evaluation of World Models! Focus: benchmarks, metrics, downstream tasks, and safety. Submit papers now: https://t.co/1Vhn814Ht6