Yichen Sheng

29 days ago

Heading to #CVPR2026! I’m excited to present our work: I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners 📍 Denver 🗓 June 6, 4:45–6:45 PM MDT For years, learning-based interactive 3D scene generation has been shaped and constrained by 3D-FRONT. Methods learn layout distributions from this dataset, making their spatial priors tightly coupled to its limited diversity, and spatial relation statistics. In I-Scene, we answer a key question: ❓ Where do spatial priors for interactive 3D scene generation come from when data is limited? Our finding is surprisingly simple: strong 3D instance models already encode rich spatial priors. By reprogramming an instance-level model with scene-context attention and view-centric space, I-Scene can generalize to interactive 3D scenes without relying on heavily annotated scene datasets. 🌟 Key takeaways: 1️⃣ View-centric space matters for spatial relationships 2️⃣ Randomly composed objects can provide surprisingly strong supervision 3️⃣ Strong instance models can serve as a foundation for real-to-sim 3D scene generation Come by and chat if you’re interested in #GenAI, #SpatialIntelligence, #RealToSim, or #EmbodiedAI!

0

57

7

36

4K

Coding_Black retweeted

Bryan Catanzaro

@ctnzr

29 days ago

Nemotron 3 Ultra: Frontier smart. 5X faster. 30% cheaper. 💚💚💚

45

895

94

119

276K

about 1 month ago

Cool world action model!

Ruili Feng

@feng_ruili_frl

about 1 month ago

“Carve nature at its joints.” — after Plato We built WALL-WM, an event-centric World Action Model. Fixed chunks cut by clock. Semantic events cut by embodied dynamics. Instead of predicting fixed-length action chunks, WALL-WM learns through action-grounded events: reach, grasp, lift, move, place. The surprising part: this was not just a cleaner formulation. It gave much stronger real-world generalization across language, scenes, and tasks. Maybe the next token for robots should be an event.

1

15

2

13

4K

0

1

0

152

about 1 month ago

Try it out!

about 1 month ago

🚀 Want to see how we do real-to-sim from a single input image? We’re releasing the code for I-Scene #CVPR2026! ✨Highlights: - Stronger scene generalization trained on randomly composed objects - Scalable data generation for downstream tasks - Supports both 3D Gaussian Splatting and mesh outputs Try the online Hugging Face demo and play with I-Scene yourself! GitHub: https://t.co/nA8HICjDwz Demo: https://t.co/vPCUL10VRW Project: https://t.co/wo9hHFXvEb

2

123

19

113

14K

0

2

0

1

231

about 1 month ago

Glad to be selected as an Outstanding Reviewer. Appreciate the AC’s recognition. For reviewers who worked hard but were not selected: please don’t be discouraged. The process can be random. I’ve tried to keep my review quality high for years, and only got selected this year. Every responsible reviewer is a hero, selected or not. Your effort is not wasted — you are helping the community and making the right choice for science.

#CVPR2026 @CVPR

about 1 month ago

We are grateful to all of the 17,491 reviewers who helped make #CVPR2026 possible. We are especially pleased to recognize the following Outstanding Reviewers, whose high-quality reviews (as judged by their Area Chairs) placed them among the top 5% of reviewers.

CVPR's tweet photo. We are grateful to all of the 17,491 reviewers who helped make #CVPR2026 possible. We are especially pleased to recognize the following Outstanding Reviewers, whose high-quality reviews (as judged by their Area Chairs) placed them among the top 5% of reviewers. https://t.co/YjQppx6a8K

5

226

43

30

97K

1

4

0

207

Coding_Black retweeted

about 2 months ago

Very impressive! We observed a similar finding in I-Scene — directly mapping 2D observations into canonical 3D space makes pixel-to-3D correspondence hard to learn. Coordinate reparameterization really matters! And that is even more important for scene generation. https://t.co/wo9hHFXvEb

1

44

2

36

7K

Coding_Black retweeted

about 2 months ago

Thank you Alexandre for sharing our ICLR2026 work!

1

16

4

2K

Coding_Black retweeted

Jiawei Yang

@JiaweiYang118

2 months ago

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.

JiaweiYang118's tweet photo. Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.

Now it is 0.75, and can be even lower.

Many wonder how.

I thought it might end as a small FID prank: simple and deliberate.

It started with one question: can FID be optimized directly, and what does it reveal?

Introducing FD-loss.

56

958

157

623

231K

2 months ago

@BoyuanChen0 @OpenAI Cool model! It seems the model spend a lot of computes in high resolution training.

0

2

0

824

2 months ago

PixelDiT has been accepted to #CVPR2026 as an oral paper. Here is our code/checkpoint relase: https://t.co/SydgwR8p5N Paper: https://t.co/HtjBJR08cS Project: https://t.co/Yd4jGJAZXm Kudos to Yongsheng and @WeiX1762273!

Wei Xiong

@WeiX1762273

2 months ago

🚀 Excited to share that our paper, PixelDiT: Pixel Diffusion Transformers for Image Generation, has been accepted to #CVPR2026 as an Oral Presentation. This is a collaboration between NVIDIA and University of Rochester. Today, we release the code and checkpoints of both our class-conditional model and our 1.3B text-to-image (1024x1024 res) model for research purposes. Feel free to use them and leave comments. PixelDiT is a single-stage, end-to-end diffusion model that eliminates the need for VAEs or any vision encoders and learns the diffusion process directly in pixel space. It works well on both class-conditioned generation and text-to-image generation. Our approach completely avoids the information loss introduced by encoder's compression, which can benefit a wide range of applications, including low-level vision, image editing, etc. Pixel-space diffusion models have advanced rapidly in recent months, and our model remains state-of-the-art among models of comparable scale. We believe this represents a promising new paradigm for training visual generative models. Notably, in the paper, we specifically incorperate a section "Things We Tried But Did Not Work" (see appendix section G), to share empirical observations and negative results that may be informative for future research on pixel-space diffusion models. We hope this can benefit the community. Code&Checkpoint: https://t.co/LuKOnwLvid Latest Paper: https://t.co/CUiDZ7SlPW Project Page: https://t.co/yJfUHTpkej

WeiX1762273's tweet photo. 🚀 Excited to share that our paper, PixelDiT: Pixel Diffusion Transformers for Image Generation, has been accepted to #CVPR2026 as an Oral Presentation. This is a collaboration between NVIDIA and University of Rochester. Today, we release the code and checkpoints of both our class-conditional model and our 1.3B text-to-image (1024x1024 res) model for research purposes. Feel free to use them and leave comments.

PixelDiT is a single-stage, end-to-end diffusion model that eliminates the need for VAEs or any vision encoders and learns the diffusion process directly in pixel space. It works well on both class-conditioned generation and text-to-image generation. Our approach completely avoids the information loss introduced by encoder's compression, which can benefit a wide range of applications, including low-level vision, image editing, etc. Pixel-space diffusion models have advanced rapidly in recent months, and our model remains state-of-the-art among models of comparable scale. We believe this represents a promising new paradigm for training visual generative models.

Notably, in the paper, we specifically incorperate a section "Things We Tried But Did Not Work" (see appendix section G), to share empirical observations and negative results that may be informative for future research on pixel-space diffusion models. We hope this can benefit the community.

Code&Checkpoint: https://t.co/LuKOnwLvid
Latest Paper: https://t.co/CUiDZ7SlPW
Project Page: https://t.co/yJfUHTpkej

0

11

0

1

1K

0

10

0

1

812

Coding_Black retweeted

Johan Edstedt @Parskatt

3 months ago

Introducing LoMa, the next generation of feature matcher!

8

293

36

223

37K

Coding_Black retweeted

Phota Labs

@PhotaLabs

3 months ago

📒 Phota 101: Profile Setup covers best practices to help you get the most out of Phota Studio. Profiles are at the center of Phota. Built from your personal album, your identity models learn the details of your appearance so edits and generations across different contexts preserve your identity. Your photos and models are owned by you and are not used for any other model training.

2

10

5

6

3K

Coding_Black retweeted

Oliver Mackenzie

@oliemack

3 months ago

Here's some of the DLSS 5 material we saw in the demos but didn't get a chance to film. Here I think you can see the strengths of DLSS 5 - reflections become much more attractive. Starfield doesn't have great lighting to begin with, so the differences can be profound.

184

953

55

150

172K

Coding_Black retweeted

Ryan Shrout

@ryanshrout

4 months ago

https://t.co/cTk97cbcHA

305

1K

154

303

813K

4 months ago

@myq_1997 Yes, they are still the same.

1

0

52

4 months ago

@CVPR What does accept and suggest to finding mean? It means accept only to finding workshop or accept to main conference and also get suggested to finding workshop?

3

6

0

3K