Prithvijit @prithvijitch - Twitter Profile

Pinned Tweet

11 days ago

Cosmos 3 is out! It's our latest family of Omni World Foundation Models for Physical AI. It uses a Mixture-of-Transformers (MoT) architecture to unify a reasoner and a generator tower into a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions. It is now a leading open-source model across understanding, reasoning, generation, and action benchmarks for Physical AI. Webpage: https://t.co/PSFO1sxim4

prithvijitch's tweet photo. Cosmos 3 is out! It's our latest family of Omni World Foundation Models for Physical AI. It uses a Mixture-of-Transformers (MoT) architecture to unify a reasoner and a generator tower into a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.

It is now a leading open-source model across understanding, reasoning, generation, and action benchmarks for Physical AI. Webpage: https://t.co/PSFO1sxim4

2

36

4

18

2K

Prithvijit

@prithvijitch

9 days ago

Check out Sahil and Mengqi's work! I'm also at #CVPR2026 -- if you want to talk pre-training, evals, data, world models, or how we built Cosmos 3 (and everything that broke along the way :)), down to chat.

Sahil Khose @ CVPR 2026 ✈️ @SahilKhose

9 days ago

We are presenting WFM-Eval at two @CVPR 2026 workshops in Denver 📍 🗓️ Jun 3, Video World Models Poster 9:50–10:40 AM, Exhibit Hall A 🗓️ Jun 4, Foundation Models Meet Embodied Agents Poster 3:55–4:30 PM Come say hi 👋 Work done with @AmberZhang99 @prithvijitch @judyfhoffman

SahilKhose's tweet photo. We are presenting WFM-Eval at two @CVPR 2026 workshops in Denver 📍

🗓️ Jun 3, Video World Models
Poster 9:50–10:40 AM, Exhibit Hall A

🗓️ Jun 4, Foundation Models Meet Embodied Agents
Poster 3:55–4:30 PM

Come say hi 👋

Work done with @AmberZhang99 @prithvijitch @judyfhoffman https://t.co/9LHrMSTpbe

1

12

6

2K

0

6

1

2

982

Prithvijit

@prithvijitch

10 days ago

@hamidpalangi Thanks Hamid!

0

23

Prithvijit

@prithvijitch

11 days ago

Cosmos 3 is out! It's our latest family of Omni World Foundation Models for Physical AI. It uses a Mixture-of-Transformers (MoT) architecture to unify a reasoner and a generator tower into a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions. It is now a leading open-source model across understanding, reasoning, generation, and action benchmarks for Physical AI. Webpage: https://t.co/PSFO1sxim4

2

36

4

18

2K

Who to follow

Tanmay Gupta

@tanmay2099

Senior Research Scientist @allen_ai (Ai2) | Building multimodal agents | MolmoWeb | CVPR’23 Best Paper | Prev: PhD @ UIUC & UG @ IIT Kanpur

Karan Desai (KD)

@kdexd

Building @theworldlabs, prev: PhD @UMichCSE. I fight the devil in the details 🧐

Ani Kembhavi

@anikembhavi

AI Research @meta Former Director @wayve_ai @allen_ai. Best/Outstanding papers at CVPR, Neurips, CoRL, IROS and ICRA.

Prithvijit

@prithvijitch

11 days ago

Cosmos 3 is #1 among open-weights models!

Artificial Analysis

@ArtificialAnlys

11 days ago

NVIDIA's Cosmos 3 lands at #1 among open weights models in both Text to Image and Image to Video on the Artificial Analysis Leaderboards! Cosmos 3 is a family of omnimodal world models for Physical AI from @nvidia, unifying language, image, video, audio and action in a single Mixture-of-Transformers architecture that pairs an autoregressive reasoner with a diffusion generator. The family comes in four variants: base Nano (16B: 8B reasoner tower + 8B generator tower) and Super (64B: 32B reasoner tower + 32B generator tower) models, with the Super model also having Text2Image and Image2Video fine-tuned variants, which are the versions listed in the Artificial Analysis Arena Leaderboards. Cosmos3-Super-Text2Image (agentic) runs through an agentic prompt-upsampling harness, and takes the #1 open weights spot in Text to Image, surpassing HiDream-O1-Image-Dev-2604, Alibaba's Qwen Image Max 2512 and Black Forest Labs' FLUX.2 [dev]. Cosmos3-Super-Image2Video takes #1 open weights in Image to Video (No Audio), ahead of Lightricks' LTX-2, and Alibaba's Wan 2.2 A14B. Cosmos 3 generators take structured JSON prompts rather than plain text, so prompt upsampling is needed to reproduce these results. This upsampling can be handled by an external harness or by the model's own reasoner branch, so it can also run self-contained. Cosmos 3 is fully open under the OpenMDW 1.1 license, shipping with weights, code, curated datasets and fine-tuning recipes available on @huggingface. First-party and third-party APIs are expected over the next few weeks, with pricing to follow. See the thread below for example generations and a link to try Cosmos 3 in our arena 🧵

ArtificialAnlys's tweet photo. NVIDIA's Cosmos 3 lands at #1 among open weights models in both Text to Image and Image to Video on the Artificial Analysis Leaderboards!

Cosmos 3 is a family of omnimodal world models for Physical AI from @nvidia, unifying language, image, video, audio and action in a single Mixture-of-Transformers architecture that pairs an autoregressive reasoner with a diffusion generator.

The family comes in four variants: base Nano (16B: 8B reasoner tower + 8B generator tower) and Super (64B: 32B reasoner tower + 32B generator tower) models, with the Super model also having Text2Image and Image2Video fine-tuned variants, which are the versions listed in the Artificial Analysis Arena Leaderboards.

Cosmos3-Super-Text2Image (agentic) runs through an agentic prompt-upsampling harness, and takes the #1 open weights spot in Text to Image, surpassing HiDream-O1-Image-Dev-2604, Alibaba's Qwen Image Max 2512 and Black Forest Labs' FLUX.2 [dev].

Cosmos3-Super-Image2Video takes #1 open weights in Image to Video (No Audio), ahead of Lightricks' LTX-2, and Alibaba's Wan 2.2 A14B.

Cosmos 3 generators take structured JSON prompts rather than plain text, so prompt upsampling is needed to reproduce these results. This upsampling can be handled by an external harness or by the model's own reasoner branch, so it can also run self-contained.

Cosmos 3 is fully open under the OpenMDW 1.1 license, shipping with weights, code, curated datasets and fine-tuning recipes available on @huggingface. First-party and third-party APIs are expected over the next few weeks, with pricing to follow.

See the thread below for example generations and a link to try Cosmos 3 in our arena 🧵

18

335

51

86

39K

0

5

0

346

prithvijitch retweeted

Zekun Hao @zekun_hao

11 days ago

Look what we’re cooking! Cosmos 3 is a family of unified omnimodal world model (language, image, video, audio, action), topping multiple benchmarks! Proud to have led Cosmos3-Super-Image2Video, now the #1 open I2V model on Artificial Analysis. Hope it empowers the community!

3

61

12

8

5K

Prithvijit

@prithvijitch

11 days ago

0

1

0

111

Prithvijit

@prithvijitch

11 days ago

We're open-sourcing Cosmos 3 today along with a technical report detailing what went into building it. This project pushed us through some genuinely hard problems, and the report tries to capture the depth of that work. It has been a privilege to be able to contribute to different aspects of this project. This was a huge team effort! Technical Report: https://t.co/8rgwNhvxde

1

3

0

233

prithvijitch retweeted

Andrej Karpathy

@karpathy

about 1 month ago

This is the the quote I've been citing a lot recently.

848

47K

4K

11K

3M

prithvijitch retweeted

Anurag Bagchi @Miccooper9

5 months ago

[1/6] Ego-centric World Models We introduce EgoWM — a video world model that simulates EVE-1X humanoid interactions from a single ego-view image + full-body joint angle trajectories. Moreover it effortlessly generalizes to extreme OOD domains, including paintings !

12

417

45

261

43K

prithvijitch retweeted

Tsung-Yi Lin

@TsungYiLinCV

5 months ago

I’m thrilled to share that Cosmos Reason 2 is here, our latest open, high-accuracy reasoning vision-language model for physical AI. Read our blog to learn more 📖 https://t.co/mmlziCbcbl Download Cosmos Reason 2 👉 https://t.co/oV2KWwkOVf

1

20

7

2

3K

prithvijitch retweeted

Haotian Ye

@haotian_yeee

6 months ago

🤔Want a principled way to RL your diffusion model? Check Data-regularized Reinforcement Learning (DDRL)! Post-train @nvidia #Cosmos World Foundation models with a million GPU hours! 🤯 Novel formulation ➡️ Theoretically integrates SFT into RL ➡️ Robust to Reward Hacking 🛑 Details: https://t.co/1A9q8ho2xb #DDRL #Diffusion #RL #NVIDIA #Cosmos

4

270

75

184

77K

prithvijitch retweeted

Kaiwen Zheng @zkwthu

8 months ago

🚀Try out rCM—the most advanced diffusion distillation! ✅First to scale up sCM/MeanFlow to 10B+ video models ✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support ✅High quality & diversity videos in 2~4 steps Paper: https://t.co/xZZK25oIrJ Code: https://t.co/aPAo1MO0JQ

zkwthu's tweet photo. 🚀Try out rCM—the most advanced diffusion distillation!
✅First to scale up sCM/MeanFlow to 10B+ video models
✅Open-sourced FlashAttention-2 JVP kernel & FSDP/CP support
✅High quality & diversity videos in 2~4 steps
Paper: https://t.co/xZZK25oIrJ
Code: https://t.co/aPAo1MO0JQ https://t.co/wvJOGOOKXY

1

181

32

83

38K

prithvijitch retweeted

Anurag Bagchi @Miccooper9

8 months ago

[ICCV 25] Refer Everything Model (REM) (1/6) We leverage Text-to-Video Generation models to zero-shot segment any concept in a video using text. REM generalises to dynamic concepts like smoke, light-beam and more without ever having seen segmentation masks for these entities.

1

92

12

69

10K

prithvijitch retweeted

Ayush Shrivastava @ayshrv

8 months ago

(1/n) Can pretrained video diffusion models be prompted to track pixels — without any retraining? We introduce Point-Prompting, a zero-shot point tracking method that simply prompts video models to visually mark and propagate points across time. 🌐 https://t.co/ZhNTp7e8zt

1

96

22

54

10K

prithvijitch retweeted

Jiasen Lu @jiasenlu

9 months ago

Vision tokenizers are stuck in 2020🤔while language models revolutionized AI🚀 Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND understanding in a single transformer framework. Paper: https://t.co/wiN4WJDV6I | Code & models coming soon 🧵

6

370

72

229

43K

prithvijitch retweeted

Yogesh

@YogeshBalaji95

12 months ago

Catch our #CVPR2025 poster today! 🖼️ “A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation” 📍 Exhibit Hall D, Poster #230 🕓 4:00–6:00 PM We explore how LLMs perform as text encoders for image generation—with some interesting findings! 🔗 Webpage: https://t.co/lUkWuHdHuk 📄 Paper: https://t.co/mOOCQfXZ2P Amazing work by @Andrewzzzwang and @Songwei_Ge during their internship at NVIDIA

0

16

2

0

2K

Prithvijit

@prithvijitch

12 months ago

Haoqi Fan talking about “BAGEL: Unified Multimodal Model as World Foundational Model” in Room 108 right now!

0

1

0

124

Prithvijit

@prithvijitch

about 1 year ago

The WorldModelBench workshop is happening tomorrow (June 12th) at #CVPR2025! We have an exciting series of talks, do attend! Place: Room 108 Time: Morning Session #NVIDIAResearch

Prithvijit

@prithvijitch

over 1 year ago

Join us at the WorldModelBench workshop at #CVPR2025 where we'll tackle systematic evaluation of World Models! Focus: benchmarks, metrics, downstream tasks, and safety. Submit papers now: https://t.co/1Vhn814Ht6

1

33

14

5

20K

1

19

10

1

4K

Prithvijit

@prithvijitch

12 months ago

Aditya Grover is talking about “Diffusion Language Models for Multimodal Understanding” in Room 108 right now!

1

2

0

149

Prithvijit

@prithvijitch

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users