Tech2Wild

Ran NVIDIA Nemotron-3-Ultra-550B fully local across 2 DGX Sparks (188GB split via llama.cpp RPC) 🤯 Findings: it works + reasons — but ~5 tok/s, since RPC is round-trip-bound (dual-node is slower per-token than one; it's a capacity play). But I question bigger≠better: 2-bit 550B barely tops a clean 4-bit ~285B. Can we agree ?

Tech2Wild

@Tech2Wild

about 22 hours ago

@outsource_ Model: unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF, UD-Q2_K_XL (~188 GiB, 6 shards)

158

Tech2Wild

@Tech2Wild

2 days ago

Very Interesting, I could imagine running this on a laptop and having a personal agent on that laptop that is able to help you with whatever task you need locally. Google delivers first on the perfect AGENT and size on a RTX Spark Laptop.

Google

@Google

2 days ago

Today we’re introducing Gemma 4 12B — our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop. It delivers performance nearing our larger Gemma models with a much smaller total memory footprint, while being small enough to run locally with just 16GB of VRAM. It’s open and accessible for everyone to use under a permissive Apache 2.0 license. This is all made possible by our new, unified architecture that removes separate multimodal encoders. Here’s how we did it 🧵

Google's tweet photo. Today we’re introducing Gemma 4 12B — our latest open model that brings advanced agentic reasoning, vision and audio directly to your laptop.

It delivers performance nearing our larger Gemma models with a much smaller total memory footprint, while being small enough to run locally with just 16GB of VRAM. It’s open and accessible for everyone to use under a permissive Apache 2.0 license.

This is all made possible by our new, unified architecture that removes separate multimodal encoders. Here’s how we did it 🧵

245

831K

Tech2Wild

@Tech2Wild

3 days ago

@CaptainHaHaa This is dope man, what model was this ?

Tech2Wild retweeted

ToNYD2WiLD

@ToNYD2WiLD

3 days ago

I honestly had this thought pop up in my head yesterday bro foreal. Like I’ve been diving heavy into AI and where that’s going and the efficiency and pricing of AI in China and how it’s winning people over. BUT now even our athletes like Step Curry. I am pro American but China has a plan and there way to it is to undercut American product.

Tech2Wild

@Tech2Wild

3 days ago

@AI_Homelab Yea I got one more left lol. After that I’m going to have to grab me a new motherboard and rig and begin my workstation lol

Tech2Wild

@Tech2Wild

5 days ago

I just found out my motherboard has a slot for a 3rd GPU. What can 3 x RTX 3090s actually do? I need ideas, someone tell me please!

753

Tech2Wild

@Tech2Wild

3 days ago

Qwen, when?

Tech2Wild

@Tech2Wild

4 days ago

Going to bed now, @Alibaba_Qwen I am hoping to wake up to Qwen 35B A3 and/or 27B tomorrow.

Tech2Wild

@Tech2Wild

4 days ago

So this would do really well in a Reachy 2 Mini I'm thinking.

Ming-Yu Liu

@liu_mingyu

4 days ago

Introducing NVIDIA Cosmos 3 We released NVIDIA Cosmos 3 last night. And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment. Here’s the breakdown: The Leaderboard Wins World Reasoning 🏆 #1 open model on VANTAGE-Bench for vision AI 🏆 #1 overall on Traffic Anomaly Reasoning (TAR) World Generation 🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard 🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard 🏆 #1 open model on PAI-Bench for physical AI synthetic data generation 🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws 🏆 #1 open model on R-Bench for world generation quality World Action 🏆 #1 on RoboArena for specialized policy 🏆 #1 on RoboLab for action generation But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place. The Problem Training robots and autonomous systems in the real world is painfully hard. Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise. And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.” You can’t collect your way out of an infinite physical world. You have to generate it. That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it? What We Built Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences It is not just a VLM. Not just a video generator. Not just a robot policy model. It is all of them, in one single model. That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation. Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model. No separate architecture required. The Architecture Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture. One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding. The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories. Two towers. Dual-stream joint attention. One shared world representation. Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space. Action is a first-class modality in Cosmos 3. That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action. Why This Matters One of the most interesting findings from the ablation work is that training action domains together creates positive transfer. That means adding more embodiments does not just add more use cases. It can actually make the model better. This is the heart of why omnimodal training matters. A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger. The part I’m most excited about is that Cosmos 3 is fully open. Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks. Everything is available under the Linux Foundation’s OpenMDW 1.1 License. You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately. You can post-train it for your own domain, embodiment, or accuracy target. That’s what makes this feel different. Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines. For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents. That’s what we built Cosmos 3 for. I cannot wait to see what you build with it. Download Models on Hugging Face https://t.co/LAZoVygeim Customize Models on GitHub https://t.co/ZVQBNdqXDD Read the Tech Blog to Learn More https://t.co/Hn6Op9YeG1

liu_mingyu's tweet photo. Introducing NVIDIA Cosmos 3

We released NVIDIA Cosmos 3 last night.

And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.

Here’s the breakdown:

The Leaderboard Wins

World Reasoning
🏆 #1 open model on VANTAGE-Bench for vision AI
🏆 #1 overall on Traffic Anomaly Reasoning (TAR)

World Generation
🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard
🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard
🏆 #1 open model on PAI-Bench for physical AI synthetic data generation
🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws
🏆 #1 open model on R-Bench for world generation quality

World Action
🏆 #1 on RoboArena for specialized policy
🏆 #1 on RoboLab for action generation

But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place.

The Problem

Training robots and autonomous systems in the real world is painfully hard.

Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise.

And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.”

You can’t collect your way out of an infinite physical world. You have to generate it.

That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it?

What We Built

Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences

It is not just a VLM.

Not just a video generator.

Not just a robot policy model.

It is all of them, in one single model.

That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation.

Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model.

No separate architecture required.

The Architecture

Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture.

One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding.

The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories.
Two towers. Dual-stream joint attention. One shared world representation.

Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space.

Action is a first-class modality in Cosmos 3.

That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action.

Why This Matters

One of the most interesting findings from the ablation work is that training action domains together creates positive transfer.

That means adding more embodiments does not just add more use cases. It can actually make the model better.

This is the heart of why omnimodal training matters.

A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger.

The part I’m most excited about is that Cosmos 3 is fully open.

Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks.

Everything is available under the Linux Foundation’s OpenMDW 1.1 License.

You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately.

You can post-train it for your own domain, embodiment, or accuracy target.

That’s what makes this feel different.

Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines.

For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents.

That’s what we built Cosmos 3 for. I cannot wait to see what you build with it.

Download Models on Hugging Face
https://t.co/LAZoVygeim

Customize Models on GitHub
https://t.co/ZVQBNdqXDD

Read the Tech Blog to Learn More
https://t.co/Hn6Op9YeG1

452

196

63K

Tech2Wild

@Tech2Wild

4 days ago

I'm starting to feel like Minimax is holding out on the weight sizes for next week because they've maintained similar weights around Minimax 2.7, I really hope 👀

Tech2Wild

@Tech2Wild

4 days ago

@sakurayukiai Thanks Sakura I need to check this out. I am still learning LOCAL. Question what are the benefit of running unquant vs Autoround INT4(what I run now). I am guessing smarter answers being #1 right ? One thing I have to juggle there is do I lose currency ? Do I lose speed ?

Tech2Wild

@Tech2Wild

4 days ago

@CanOfWorms777 yes, I been looking to grab one off EBAY and yea it seems like 2 for 1 model and then the last one having something different.

Tech2Wild

@Tech2Wild

4 days ago

@loktar00 that's what im thinking. One I had one I was always swapping back and forth 27b/35B/ and LTX2.3. Now I can just keep 27b going and have LTX 2.3 chilling OR 35B and have a full card and I'd still get 100TPS on that too lol. 27B on dual sparks feels fake bro.

101

Tech2Wild

@Tech2Wild

5 days ago

NVIDIA just announced Nemotron 3 Ultra. 550B parameters, hits 91% on agent productivity, 82% on instruction following, and 95% on 1M token context. They're not just building chips anymore, they're building the entire AI software stack.

Tech2Wild's tweet photo. NVIDIA just announced Nemotron 3 Ultra. 550B parameters, hits 91% on agent productivity, 82% on instruction following, and 95% on 1M token context. They're not just building chips anymore, they're building the entire AI software stack. https://t.co/Kgx1rItf03

110

Tech2Wild

@Tech2Wild

5 days ago

Every CPU ever made was built for humans. The new Vera CPU is built for AGENTS. Jensen: "agents are impatient, they live in nanoseconds, not seconds." The chip's whole job just flipped: serving people to serving AI. The agent era isn't coming, it's the roadmap.

Tech2Wild's tweet photo. Every CPU ever made was built for humans. The new Vera CPU is built for AGENTS.

Jensen: "agents are impatient, they live in nanoseconds, not seconds."

The chip's whole job just flipped: serving people to serving AI. The agent era isn't coming, it's the roadmap. https://t.co/0GhSTyBTDG

112

Tech2Wild

@Tech2Wild

5 days ago

My agent Kai is watching the Nvidia GTC keynote right now via Eyes of Kai. He can see and hear everything as it comes in. The dope part is I can ask him questions about what is happening and he answers while still gathering the live speech to text in the background.

Tech2Wild

@Tech2Wild

5 days ago

@Tono_Ken3 I definitely want to look into a rig like yours or something like I can build into. Something I can invest GPUs in monthly or bi monthly

Tech2Wild

@Tech2Wild

Last Seen Users on Sotwe

Trends for you

Most Popular Users