Introducing NVIDIA Cosmos 3
We released NVIDIA Cosmos 3 last night.
And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.
Here’s the breakdown:
The Leaderboard Wins
World Reasoning
🏆 #1 open model on VANTAGE-Bench for vision AI
🏆 #1 overall on Traffic Anomaly Reasoning (TAR)
World Generation
🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard
🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard
🏆 #1 open model on PAI-Bench for physical AI synthetic data generation
🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws
🏆 #1 open model on R-Bench for world generation quality
World Action
🏆 #1 on RoboArena for specialized policy
🏆 #1 on RoboLab for action generation
But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place.
The Problem
Training robots and autonomous systems in the real world is painfully hard.
Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise.
And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.”
You can’t collect your way out of an infinite physical world. You have to generate it.
That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it?
What We Built
Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences
It is not just a VLM.
Not just a video generator.
Not just a robot policy model.
It is all of them, in one single model.
That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation.
Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model.
No separate architecture required.
The Architecture
Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture.
One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding.
The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories.
Two towers. Dual-stream joint attention. One shared world representation.
Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space.
Action is a first-class modality in Cosmos 3.
That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action.
Why This Matters
One of the most interesting findings from the ablation work is that training action domains together creates positive transfer.
That means adding more embodiments does not just add more use cases. It can actually make the model better.
This is the heart of why omnimodal training matters.
A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger.
The part I’m most excited about is that Cosmos 3 is fully open.
Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks.
Everything is available under the Linux Foundation’s OpenMDW 1.1 License.
You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately.
You can post-train it for your own domain, embodiment, or accuracy target.
That’s what makes this feel different.
Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines.
For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents.
That’s what we built Cosmos 3 for. I cannot wait to see what you build with it.
Download Models on Hugging Face
https://t.co/LAZoVygeim
Customize Models on GitHub
https://t.co/ZVQBNdqXDD
Read the Tech Blog to Learn More
https://t.co/Hn6Op9YeG1
NVIDIA's Cosmos 3 lands at #1 among open weights models in both Text to Image and Image to Video on the Artificial Analysis Leaderboards!
Cosmos 3 is a family of omnimodal world models for Physical AI from @nvidia, unifying language, image, video, audio and action in a single Mixture-of-Transformers architecture that pairs an autoregressive reasoner with a diffusion generator.
The family comes in four variants: base Nano (16B: 8B reasoner tower + 8B generator tower) and Super (64B: 32B reasoner tower + 32B generator tower) models, with the Super model also having Text2Image and Image2Video fine-tuned variants, which are the versions listed in the Artificial Analysis Arena Leaderboards.
Cosmos3-Super-Text2Image (agentic) runs through an agentic prompt-upsampling harness, and takes the #1 open weights spot in Text to Image, surpassing HiDream-O1-Image-Dev-2604, Alibaba's Qwen Image Max 2512 and Black Forest Labs' FLUX.2 [dev].
Cosmos3-Super-Image2Video takes #1 open weights in Image to Video (No Audio), ahead of Lightricks' LTX-2, and Alibaba's Wan 2.2 A14B.
Cosmos 3 generators take structured JSON prompts rather than plain text, so prompt upsampling is needed to reproduce these results. This upsampling can be handled by an external harness or by the model's own reasoner branch, so it can also run self-contained.
Cosmos 3 is fully open under the OpenMDW 1.1 license, shipping with weights, code, curated datasets and fine-tuning recipes available on @huggingface. First-party and third-party APIs are expected over the next few weeks, with pricing to follow.
See the thread below for example generations and a link to try Cosmos 3 in our arena 🧵
If you've ever seen someone tweet some cool shader and thought "I don't really even know what a shader is and at this point I'm too afraid to ask" - I've written something just for you.
https://t.co/0ez5xz5vCP
Scott Adams, facing death, shows us how to live.
Someone recommended “How to Fail at Almost Everything and Still Win Big” by Scott Adams. I had burned out on mainstream books, but picked it up, and was hooked. He had put into words a way of living, similar to one I had found, except his approach was systemic and analytical. Better than my own slapdash notes. Outside of religious texts, Adams was and is as close to a “guide to life,” as you’ll ever find. And even if you’re religious, you still live in this world, and would be wise to learn how to navigate it.
Scott is closing in on the end of his life, and even now he is creating new beginnings.
I’d better write this now, I won’t be able to when it’s too late.
After losing Charlie Kirk, a lot of us are wondering how we can possibly write another obituary. While there’s much to complain about the internet and social media, those mediums expanded the sizes of our communities, our influences, and indeed our families. Too often we find new ways to hate people, instead of finding new people to love.
Scott Adams comes up in conversation at every social event I host. “How is Scott Adams doing? Will he make it?” We all talk about streams we watched and lessons learned. It’s a memorial except he’s still alive. Scott would love to hear that, which is why I have said so repeatedly. I’ve lost too many people, via death or fallings-out, to leave feeling unexpressed.
He’s been a surrogate father figure and mentor to millions of people.
Scott Adams is not liked, he is loved.
People don’t “like” Scott Adams, they aren’t “a fan of his.” They love this man. And I do as well. I’m still living in denial of his fate. We all are.
We’d been making a film about the meaning of life, and while Scott Adams had been in both of our other films, we hadn’t booked him for Meaning yet. Then we found out he was going to take the ride of assisted suicide. Foolishly, we had assumed he’d always be around. Nobody ever dies, right? Your dad will be there to take your call the next time you phone home. Your friends aren’t going anywhere. That’s how we too often live. We could book Scott later.
We reached out and he graciously agreed to be interviewed. We all knew it was going to be our last interview together. Scott and I are both efficient with our time. When a moment is over, it’s time to go do something else. Obligations call. The crew pushed this one as long as we could.
After the interview wrapped up and the gear was packed and it was time to go, there was an awkward pause. I broke it.
“Scott, we love you.” He said thank you. “No, Scott, we love you, I mean it, we all do. We love you.”
None of us broke down crying, not that there would have been any shame in that, but we no doubt all soon will.
Well then, what is the lesson of Scott Adams?
On a practical level, the lesson of Scott Adams is the power of showing up. Nobody works harder and on a more regular schedule. You can set your clock to Scott’s show. Too many of us wait for the muse of inspiration or the jolt of information to force us into action. Work, everyday, maybe in obscuring and without tangible benefits for years. Eventually you’ll hit your mark and go beyond.
Scott plugged away with his streams from a small account (after a huge career via Dilbert) and soon became must-watch, and then transcended his role to becoming something much more.
On a spiritual level, we might ask, why do we love Scott? It’s not because he’s so smart (he is). There are not shortage of intelligent, clever, Machiavellian, and rich people with podcasts. When one of them dies, what is lost? All of that Ego and desire for adoration, and does anybody even care? When those people fall while living, who will be there?
Scott is loved because he’s devoted his life to service to humanity. “What is the meaning of life,” is the question we ask every interviewee, and Scott’s answer, “Be useful to humanity.”
Despite pain, sickness, and inevitable death, Scott is doing his daily streams, serving his country and all of humankind until his end.
He’s a light to the world and a mirror for all of us.
What exactly are we doing with the gift of life given to us by God. (Scott believes in the Simulation, but I believe God evens this all out in the Judgment.) Are we doing enough for others? Are we doing anything for others?
Like everyone else, I’m capable of throwing myself a pity party. Sometimes when life is going too well, and I don’t have real problems, I invent some. That’s where the Ego brings you, recursively worshipping itself, and when that fails, tormenting itself, as each path leads to its own attention.
May all of us live more like Scott Adams, and may God bless his immortal soul when he passes.
P.S. I ran this article through Grok for typos. The original version had “immoral” soul where I meant it to read “immortal.” I think Scott would have had a great laugh had that typo been left in.
Gemini knows your location and current date, so you can ask gemini to get the location and date by itself, e.g.
----
City name: {get my location from my profile}
Date: {get current date}
----
------ full prompt -------
Present a clear, 45° top-down view of a vertical (9:16) isometric miniature 3D cartoon scene, highlighting iconic landmarks centered in the composition to showcase precise and delicate modeling.
The scene features soft, refined textures with realistic PBR materials and gentle, lifelike lighting and shadow effects. Weather elements are creatively integrated into the urban architecture, establishing a dynamic interaction between the city's landscape and atmospheric conditions, creating an immersive weather ambiance.
Use a clean, unified composition with minimalistic aesthetics and a soft, solid-colored background that highlights the main content. The overall visual style is fresh and soothing.
Display a prominent weather icon at the top-center, with the date (x-small text) and temperature range (medium text) beneath it. The city name (large text) is positioned directly above the weather icon. The weather information has no background and can subtly overlap with the buildings.
The text should match the input city's native language.
Please retrieve current weather conditions for the specified city before rendering.
City name: {get my location from my profile}
Date: {get current date}
Have you heard what we’ve been cooking? 🧑🍳
We’re serving up step-by-step recipes for post-training, inference, data curation, and more in our Cosmos Cookbook.
📖 Guided video augmentations for realistic transformations
📖 Domain adaptation and synthetic data augmentation for autonomous vehicle research
📖 Sim2Real data augmentation for robotics navigation
Read our blog to learn more ➡️ https://t.co/4GTSH4E5d9
Start cooking ➡️ https://t.co/CVJYppJdgy
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀
Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video.
In pursuit of minimal modeling, DA3 reveals two key insights:
💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture.
✨ A single depth-ray representation is enough. No complex 3D tasks.
Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series.
The core team members, aside from me: @HaotongLin, Sili Chen, Jun Hao Liew, @donydchen.
👇(1/n)
#DepthAnything3