Nicholas Pfaff @NicholasEPfaff - Twitter Profile

Pinned Tweet

4 months ago

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: https://t.co/UZklSkJe9V 👇🧵(1/8)

18

564

80

305

74K

Nicholas Pfaff

@NicholasEPfaff

about 19 hours ago

Its very hard to obtain some type of proof for that, so its mainly qualitative. Visually, we run a large-scale user study. For robotic sim, we teleoperate robots, run pretrained policies zero-shot (they were mainly trained on real data and not in our scenes and thus wouldn't work if it wouldn't be close enough to their real world training data), and run an evaluation pipeline that successfully differentiates policies of varying qualities.

0

2

0

72

Nicholas Pfaff

@NicholasEPfaff

about 20 hours ago

SceneSmith is now an ICML 2026 Spotlight (top 2.2%) and will be presented in Korea this summer! Meet SceneSmith: an agentic system that generates simulation-ready indoor environments from a single text prompt. New in the camera-ready: zero-shot rollouts of an externally trained robot policy inside generated SceneSmith scenes. 👇 (1/3)

4

82

9

42

6K

Nicholas Pfaff

@NicholasEPfaff

about 20 hours ago

And one more of our teleop demos (head and external camera view) in the generated scenes. The project site has more zero-shot videos, more teleop videos, and videos of a mobile iiwa policy being evaluated in our scenes.

0

153

Nicholas Pfaff

@NicholasEPfaff

about 20 hours ago

The camera-ready adds a new page on qualitative robot simulation demonstrations, plus additional baseline-control experiments, and a detailed limitations/failure analysis. Project: https://t.co/UZklSkJe9V Paper (updated): https://t.co/0Mlq1Z0NLp Code: https://t.co/UlrTxh2yIT Joint work with @cohnthomas43, @ZakharovSergeyN, @RickCory21, @RussTedrake (3/3)

1

5

1

2

426

Nicholas Pfaff

@NicholasEPfaff

19 days ago

Integration them could be really great, especially as limited articulated is a big limitation of SceneSmith at the moment. I also think that future scene generation systems could hugely benefit from some of your efficiency ideas to make it easier to scale them without huge budgets.

0

33

NicholasEPfaff retweeted

Sergey Zakharov

@ZakharovSergeyN

about 1 month ago

Releasing RecGen: a collaboration between @ToyotaResearch, @toyota_europe, and @UvA_Amsterdam tackling a core 3D vision challenge: reconstructing complete multi-object scenes (parts, poses, textures, even occluded geometry) from just 1 to a few RGB-D views. Trained purely on synthetic data, RecGen achieves SOTA on real-world robotics and 6D pose benchmarks, handling occlusions, symmetry, and complex interactions. A step toward scalable, high-fidelity digital twins for robotics, and better evaluation and training of generalist policies. https://t.co/x4EEcRy77V

2

220

35

170

27K

NicholasEPfaff retweeted

Katherine Liu @robo_kat

about 1 month ago

Also, if you’re wondering how we generated all these cool videos from the Drake sim, check out @NicholasEPfaff’s repo https://t.co/hK9aq6rrj7 as a starting point 👀

0

29

6

22

3K

NicholasEPfaff retweeted

Jean Mercat @MercatJean

about 1 month ago

Releasing VLA Foundry: an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. End-to-end control from language pretraining to action-expert fine-tuning — no more stitching together incompatible repos.

10

489

76

364

74K

NicholasEPfaff retweeted

jenny huang @JennyHuang99

3 months ago

🧵1/ 🤔New paper: Do LLMs Benefit from Their Own Words? In multi-turn chats, models are typically given their own past responses as context. But do their own words always help… or can they sometimes be a distraction?

JennyHuang99's tweet photo. 🧵1/
🤔New paper: Do LLMs Benefit from Their Own Words?

In multi-turn chats, models are typically given their own past responses as context.
But do their own words always help… or can they sometimes be a distraction? https://t.co/ZvB84KFgPp

6

170

34

124

18K

Nicholas Pfaff

@NicholasEPfaff

4 months ago

I haven't ever timed this 😅 However, we implemented a bunch of performance improvements targeting throughput over latency. Hence, it wouldn't be much faster than when generating ~25 scenes or so in parallel. The biggest bottleneck is API response times. Hence, we have an option to opt into OpenAI's priority tier that speeds this up by 50% but is twice as expensive. Switching to Gemini Flash (or other speed-optimized models should also make a big difference here).

0

1

0

46

Nicholas Pfaff

@NicholasEPfaff

4 months ago

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: https://t.co/UZklSkJe9V 👇🧵(1/8)

18

564

80

305

74K

Nicholas Pfaff

@NicholasEPfaff

4 months ago

@vatsalbajaj We have not tried RL-based training yet. We do have teleop demos on the website that could be used for supervised learning. RL would be exciting to try!

1

0

76

Nicholas Pfaff

@NicholasEPfaff

4 months ago

@YufeiWang25 Very cool! Thanks for sharing. Using image priors is a promising way to improve spatial reasoning.

0

154

Nicholas Pfaff

@NicholasEPfaff

4 months ago

Agreed here. VLMs seem to struggle with spatial imagination. Setting a table with place settings that face in different directions (and not just toward the current image render) is a revealing case of this. Image generative models are much better at this. Maybe the next version will use agentic video models?

0

1

0

95

NicholasEPfaff retweeted

Shivaram Kumar @shirakuex

4 months ago

@allen_ai This is awesome work! Curious—any plans to integrate SceneSmith-like agentic scene generation into MolmoSpaces? It feels like a natural combo: MolmoSpaces benchmark + SceneSmith prompt-to-sim scenes = infinite evaluation distribution. https://t.co/wCgZcS8dsY

1

4

1

3

1K

NicholasEPfaff retweeted

Ai2 @allen_ai

4 months ago

Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.

10

718

103

362

97K

NicholasEPfaff retweeted

Ilir Aliu

@IlirAliu_

4 months ago

Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments. 📍 Paper AND Code: Instead of hand-building scenes in simulation, you write one prompt. SceneSmith builds the world for you. > Room layout. > Furniture. > Wall and ceiling objects. > Small movable items. Each stage is handled by a team of VLM agents: one proposes, one critiques, one coordinates. The result is not just pretty scenes, but physics-ready environments. Every object: •Metric scale •Collision geometry •Estimated mass, inertia, friction •<2% object collisions •96% stable under gravity And it exports directly to MJX, USD, SDFormat. If you train or evaluate robot policies, environment creation is usually the bottleneck. SceneSmith turns it into an on-demand layer. You can generate dozens of diverse scenes per task and automatically evaluate policies across them, with 99.7% agreement to human labels. That means: •More robust policies •Faster benchmarking •No hand-written success predicates 205 participants preferred SceneSmith scenes 92% of the time for realism and 91% for prompt faithfulness. Environment generation is no longer the slow part of robot research. If you work on sim2real, policy scaling, or automated evaluation, this is worth bookmarking and sharing with your team. 📍GitHub: https://t.co/ZAKnAPbcEq Paper: https://t.co/xAD1zwyydR Code: https://t.co/ya5DVlkumV —- Weekly robotics and AI insights. Subscribe free: https://t.co/9Nm01QUcw3

5

44

1

21

4K

Nicholas Pfaff

@NicholasEPfaff

4 months ago

We don't have an explicit argument for that. However, the input just gets sent to a VLM agent, which can natively take both text and image inputs. Hence, supporting this seems like a minor code change. You could already use a VLM to describe a set of images in great detail in text and use that. We have been doing that, and it works quite well.

0

18

Nicholas Pfaff

@NicholasEPfaff

4 months ago

@sippeyxp @RussTedrake All objects already have estimated friction, and we do support articulated objects.

0

1

0

60

Nicholas Pfaff

@NicholasEPfaff

Last Seen Users on Sotwe

Trends for you

Most Popular Users