Johnny Núñez

@johnnync13

Physical AI and Robotics at @NVIDIA

Barcelona

Joined August 2015

1.8K Following

572 Followers

6.5K Posts

Pinned Tweet

Johnny Núñez

@johnnync13

over 1 year ago

@sama @OpenAI My 93-year-old grandfather discovers ChatGPT Voice Mode for the first time, and the results are nothing short of amazing. He loved it, and it made him so happy. Affective computing and AI like this will transform life for older adults. #MerryChristmas

Johnny Núñez

@johnnync13

2 days ago

Newton arrived to Isaac Sim 6.0. Full features are coming

Prox Industries株式会社【公式】

@prox_industries

2 days ago

【Newton × VBD：ロボットによる布操作シミュレーション】 Isaac Lab上に、Newton の VBD（Vertex Block Descent）を用いた布のシミュレーション環境を構築し、仮想空間内のロボットをVRデバイスでテレオペレーションするシステムを実装しました。動画では、Frankaのロボットアームを操作して布に触れる・押す・持ち上げるといった操作を行っています。結果：一部でロボットと布の貫通は見られたものの、従来の物理シミュレーションで課題になりやすかった「自己衝突（布が自分自身にめり込む現象）」や「極端な変形」に対して、よりロバストな挙動が確認できました。 💡Newton / VBDについて Newton は、NVIDIA・Google DeepMind・Disney Research が共同開発したオープンソースの物理エンジンです。GPU加速・高い拡張性・学習フレームワークとの連携を特徴とし、ロボット学習で重要になる柔軟物・接触・変形など、複雑な物理現象への対応が期待されています。 VBDは、布や柔軟物のような変形体を安定して扱うための手法で、自己衝突や大きな変形を含むシーンでも計算のロバスト性を高めることが期待されます。 Sim2Realでは、シミュレーションと実機の差＝リアリティギャップが課題になります。特に、布や衣類のように形状が固定されず、接触や変形によって状態が大きく変わる対象を扱うタスクでは、より現実に近い物理現象を学習・評価環境に取り込むことが重要です。 Prox Industriesでは、これまで様々な環境・タスク・実機ロボットにおけるSim2Realの実装に取り組んできました。引き続き、Newtonを弊社技術に統合し、柔軟物操作、複雑な接触を伴うタスクなど、より幅広いPhysical AIの学習・評価環境を現実に近づけていきます。 #PhysicalAI #Robotics #Sim2Real #Newton #IsaacLab

188

102

12K

335

johnnync13 retweeted

Zhengyi “Zen” Luo

@zhengyiluo

3 days ago

Wanna train humanoids to do useful things? Data is a first! With GRAIL, we have unlimited data potential at our hands!

Johnny Núñez

@johnnync13

2 days ago

@Teknium Congrats all the team!

Who to follow

Rubén Ballester

@rballeba

Theorem proving and reasoning @ AxiomMath. PhD in Machine Learning

3 days ago

@passionvirus @ROBOTIS And we launch today Isaac Sim 6.0 GA with Newton Backend support🔥

115

johnnync13 retweeted

Hamid Eghbalzadeh @heghbalz

4 days ago

Excited to share Cosmos 3: an omnimodal world model for Physical AI that connects reasoning, generation, and simulation across text, image, video, audio, and action. https://t.co/Ksnjo3kRm3

924

Johnny Núñez

@johnnync13

4 days ago

@yinghui_he_ Welcome to Nvidia family!

135

johnnync13 retweeted

driss guessous @drisspg

4 days ago

I am trying to make ideogram usable on my spark; Problem 1. https://t.co/q2LRrDOcFJ Problem 2. Bitsandbytes is unbelievable slow

johnnync13 retweeted

Ming-Yu Liu

@liu_mingyu

6 days ago

Introducing NVIDIA Cosmos 3 We released NVIDIA Cosmos 3 last night. And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment. Here’s the breakdown: The Leaderboard Wins World Reasoning 🏆 #1 open model on VANTAGE-Bench for vision AI 🏆 #1 overall on Traffic Anomaly Reasoning (TAR) World Generation 🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard 🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard 🏆 #1 open model on PAI-Bench for physical AI synthetic data generation 🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws 🏆 #1 open model on R-Bench for world generation quality World Action 🏆 #1 on RoboArena for specialized policy 🏆 #1 on RoboLab for action generation But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place. The Problem Training robots and autonomous systems in the real world is painfully hard. Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise. And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.” You can’t collect your way out of an infinite physical world. You have to generate it. That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it? What We Built Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences It is not just a VLM. Not just a video generator. Not just a robot policy model. It is all of them, in one single model. That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation. Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model. No separate architecture required. The Architecture Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture. One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding. The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories. Two towers. Dual-stream joint attention. One shared world representation. Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space. Action is a first-class modality in Cosmos 3. That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action. Why This Matters One of the most interesting findings from the ablation work is that training action domains together creates positive transfer. That means adding more embodiments does not just add more use cases. It can actually make the model better. This is the heart of why omnimodal training matters. A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger. The part I’m most excited about is that Cosmos 3 is fully open. Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks. Everything is available under the Linux Foundation’s OpenMDW 1.1 License. You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately. You can post-train it for your own domain, embodiment, or accuracy target. That’s what makes this feel different. Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines. For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents. That’s what we built Cosmos 3 for. I cannot wait to see what you build with it. Download Models on Hugging Face https://t.co/LAZoVygeim Customize Models on GitHub https://t.co/ZVQBNdqXDD Read the Tech Blog to Learn More https://t.co/Hn6Op9YeG1

liu_mingyu's tweet photo. Introducing NVIDIA Cosmos 3

We released NVIDIA Cosmos 3 last night.

And today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.

Here’s the breakdown:

The Leaderboard Wins

World Reasoning
🏆 #1 open model on VANTAGE-Bench for vision AI
🏆 #1 overall on Traffic Anomaly Reasoning (TAR)

World Generation
🏆 #1 open model on Artificial Analysis Image-to-Video leaderboard
🏆 #1 open model on Artificial Analysis Text-to-Image leaderboard
🏆 #1 open model on PAI-Bench for physical AI synthetic data generation
🏆 #1 open model on Physics-IQ, which measures accuracy on physical laws
🏆 #1 open model on R-Bench for world generation quality

World Action
🏆 #1 on RoboArena for specialized policy
🏆 #1 on RoboLab for action generation

But the leaderboards are only part of the story. The real story is why we built Cosmos 3 in the first place.

The Problem

Training robots and autonomous systems in the real world is painfully hard.

Robots need to try the same thing numerous times before they succeed reliably. Self-driving cars need rare edge cases that may never happen naturally. Smart machines need to understand physics, motion, contact, failure, and surprise.

And real-world data is slow, expensive, and sometimes dangerous to collect. At some point, the answer cannot just be “collect more data.”

You can’t collect your way out of an infinite physical world. You have to generate it.

That… was the question behind Cosmos: Can one model understand the physical world deeply enough to reason about it, simulate it, and generate actions inside it?

What We Built

Cosmos 3 is the first omni-model for physical AI. It can understand and generate across: language · images · video · audio · action sequences

It is not just a VLM.

Not just a video generator.

Not just a robot policy model.

It is all of them, in one single model.

That matters because physical AI has been fragmented for a long time. Cosmos 3 is our attempt to collapse that fragmentation.

Depending on how you configure the inputs and outputs, the same model can act as a vision-language model, a video/world generator, a world simulator, or a world-action model.

No separate architecture required.

The Architecture

Under the hood, Cosmos 3 uses a dual-tower Mixture-of-Transformers architecture.

One tower is autoregressive for reasoning. It handles next-token prediction for language and discrete understanding.

The other tower is diffusion-based- for generation. It denoises images, video, audio, and action trajectories.
Two towers. Dual-stream joint attention. One shared world representation.

Each modality gets its own tools: visual encoders, video VAEs, audio VAEs, and action projectors that can map different embodiments into a unified action space.

Action is a first-class modality in Cosmos 3.

That’s what makes it more than a video model. It doesn’t just predict and generate what the world might look like. It can connect reasoning and world modeling to physically grounded action.

Why This Matters

One of the most interesting findings from the ablation work is that training action domains together creates positive transfer.

That means adding more embodiments does not just add more use cases. It can actually make the model better.

This is the heart of why omnimodal training matters.

A shared world representation is not just convenient. It can make each individual task stronger. That’s the part that feels like the beginning of something much bigger.

The part I’m most excited about is that Cosmos 3 is fully open.

Developers get the models, scripts, optimization, inference endpoints, post-training recipes, datasets, and benchmarks.

Everything is available under the Linux Foundation’s OpenMDW 1.1 License.

You can use Cosmos 3 out of the box. You can use the VLM, world model, or world-action pieces separately.

You can post-train it for your own domain, embodiment, or accuracy target.

That’s what makes this feel different.

Cosmos 3 is not just a model release. It is the foundation for building intelligence for autonomous machines.

For me, Cosmos 3 feels like a step toward a world where physical AI development becomes much more scalable and accessible - to a new age of developers and agents.

That’s what we built Cosmos 3 for. I cannot wait to see what you build with it.

Download Models on Hugging Face
https://t.co/LAZoVygeim

Customize Models on GitHub
https://t.co/ZVQBNdqXDD

Read the Tech Blog to Learn More
https://t.co/Hn6Op9YeG1

453

196

64K

johnnync13 retweeted

LeRobot

@LeRobotHF

6 days ago

🤖 Another zero-shot reward model is now in LeRobot: ROBOMETER. A general-purpose, zero-shot video-language reward model from @UofSC, @UT_Dallas, @MIT, @UW, @allen_ai, and @nvidia that predicts frame-level task progress. Trained on 1M+ trajectories from 21 robot embodiments, generalizes zero-shot to unseen tasks, scenes, and robots. 2.4–4.5x better downstream success rates across online RL, offline RL, data filtering, failure detection, and data retrieval for IL. Project: https://t.co/rkKUcYamYT Paper: https://t.co/gIIwNKdnzv

275

165

32K

Johnny Núñez

@johnnync13

7 days ago

@liu_mingyu @lemonaddie0909 Congrats, team! It’s incredible to see how Cosmos has evolved

johnnync13 retweeted

Ming-Yu Liu

@liu_mingyu

7 days ago

Cosmos 3 is a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. It has incredible capabilities and is ranked as the number one open-source Text2Image and Image2Video model by Artificial Analysis, and as the number one robot policy model by RoboLab and RoboArena. Try it out. model: https://t.co/LAZoVygeim code: https://t.co/ZVQBNdqXDD website: https://t.co/lC9KfkAWcj paper: https://t.co/mUgQ8gqnCb

201

23K

johnnync13 retweeted

NVIDIA AI

@NVIDIAAI

7 days ago

Introducing Cosmos 3: Our latest frontier model for Physical AI Cosmos 3 is the world’s first fully open omnimodel with native vision reasoning, world and action generation. Today we’re releasing Super (32B) and Nano (8B) variants.

406

405K

johnnync13 retweeted

NVIDIA Newsroom

@nvidianewsroom

7 days ago

Vera Rubin is in full production. #NVIDIAGTC

123

146K

johnnync13 retweeted

NVIDIA AI

@NVIDIAAI

10 days ago

We're adopting the Linux Foundation’s OpenMDW framework across our open model families. This helps make open model licensing simpler and more consistent at scale. A single legal framework across models, code, documentation, and data helps reduce friction for developers and enterprises building with open source.

701

159

213K

johnnync13 retweeted

NVIDIA GeForce

@NVIDIAGeForce

9 days ago

A new era of PC. 25.0528, 121.5990

15K

572

johnnync13 retweeted

NVIDIA

@nvidia

9 days ago

A new era of PC. 25.0528, 121.5990

29K

12M

Johnny Núñez

@johnnync13

11 days ago

@zhyncs42 Congrats! 💚

200

johnnync13 retweeted

hardmaru

@hardmaru

12 days ago

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

154

642

738K

johnnync13 retweeted

Ali Hatamizadeh

@ahatamiz1

17 days ago

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: https://t.co/Zw6yXbHjGU 💻 Code: https://t.co/s8IWwaRU18 #LinearAttention #StateSpaceModels #Mamba #LLM

ahatamiz1's tweet photo. Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove
✍️ a channel-wise write gate w_t picks which value-side coordinates to commit
🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too
⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings
Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://t.co/Zw6yXbHjGU
💻 Code: https://t.co/s8IWwaRU18

#LinearAttention #StateSpaceModels #Mamba #LLM

651

431

193K

johnnync13 retweeted

NVIDIA Robotics

@NVIDIARobotics

16 days ago

What does it take to build generalist humanoid robots? 🦾 At #ICRA2026, @yukez will explore data-centric approaches for general-purpose robot autonomy, including how real-world, synthetic and web data can help train robotic foundation models for open-world tasks. Learn more: https://t.co/hytCxlIzB4

NVIDIARobotics's tweet photo. What does it take to build generalist humanoid robots? 🦾

At #ICRA2026, @yukez will explore data-centric approaches for general-purpose robot autonomy, including how real-world, synthetic and web data can help train robotic foundation models for open-world tasks.

Learn more: https://t.co/hytCxlIzB4

122

Johnny Núñez

@johnnync13

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users