Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt.
VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties.
We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation.
Website: https://t.co/UZklSkJe9V
👇🧵(1/8)
Its very hard to obtain some type of proof for that, so its mainly qualitative. Visually, we run a large-scale user study. For robotic sim, we teleoperate robots, run pretrained policies zero-shot (they were mainly trained on real data and not in our scenes and thus wouldn't work if it wouldn't be close enough to their real world training data), and run an evaluation pipeline that successfully differentiates policies of varying qualities.
SceneSmith is now an ICML 2026 Spotlight (top 2.2%) and will be presented in Korea this summer!
Meet SceneSmith: an agentic system that generates simulation-ready indoor environments from a single text prompt.
New in the camera-ready: zero-shot rollouts of an externally trained robot policy inside generated SceneSmith scenes.
👇 (1/3)
And one more of our teleop demos (head and external camera view) in the generated scenes. The project site has more zero-shot videos, more teleop videos, and videos of a mobile iiwa policy being evaluated in our scenes.
The camera-ready adds a new page on qualitative robot simulation demonstrations, plus additional baseline-control experiments, and a detailed limitations/failure analysis.
Project: https://t.co/UZklSkJe9V
Paper (updated): https://t.co/0Mlq1Z0NLp
Code: https://t.co/UlrTxh2yIT
Joint work with @cohnthomas43, @ZakharovSergeyN, @RickCory21, @RussTedrake
(3/3)
Integration them could be really great, especially as limited articulated is a big limitation of SceneSmith at the moment. I also think that future scene generation systems could hugely benefit from some of your efficiency ideas to make it easier to scale them without huge budgets.
Releasing RecGen: a collaboration between @ToyotaResearch, @toyota_europe, and @UvA_Amsterdam tackling a core 3D vision challenge: reconstructing complete multi-object scenes (parts, poses, textures, even occluded geometry) from just 1 to a few RGB-D views.
Trained purely on synthetic data, RecGen achieves SOTA on real-world robotics and 6D pose benchmarks, handling occlusions, symmetry, and complex interactions.
A step toward scalable, high-fidelity digital twins for robotics, and better evaluation and training of generalist policies.
https://t.co/x4EEcRy77V
Also, if you’re wondering how we generated all these cool videos from the Drake sim, check out @NicholasEPfaff’s repo https://t.co/hK9aq6rrj7 as a starting point 👀
Releasing VLA Foundry: an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. End-to-end control from language pretraining to action-expert fine-tuning — no more stitching together incompatible repos.
🧵1/
🤔New paper: Do LLMs Benefit from Their Own Words?
In multi-turn chats, models are typically given their own past responses as context.
But do their own words always help… or can they sometimes be a distraction?
I haven't ever timed this 😅 However, we implemented a bunch of performance improvements targeting throughput over latency. Hence, it wouldn't be much faster than when generating ~25 scenes or so in parallel. The biggest bottleneck is API response times. Hence, we have an option to opt into OpenAI's priority tier that speeds this up by 50% but is twice as expensive. Switching to Gemini Flash (or other speed-optimized models should also make a big difference here).
Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt.
VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties.
We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation.
Website: https://t.co/UZklSkJe9V
👇🧵(1/8)
@vatsalbajaj We have not tried RL-based training yet. We do have teleop demos on the website that could be used for supervised learning. RL would be exciting to try!
Agreed here. VLMs seem to struggle with spatial imagination. Setting a table with place settings that face in different directions (and not just toward the current image render) is a revealing case of this. Image generative models are much better at this. Maybe the next version will use agentic video models?
@allen_ai This is awesome work! Curious—any plans to integrate SceneSmith-like agentic scene generation into MolmoSpaces?
It feels like a natural combo: MolmoSpaces benchmark + SceneSmith prompt-to-sim scenes = infinite evaluation distribution.
https://t.co/wCgZcS8dsY
Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖
230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.
Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments.
📍 Paper AND Code:
Instead of hand-building scenes in simulation, you write one prompt.
SceneSmith builds the world for you.
> Room layout.
> Furniture.
> Wall and ceiling objects.
> Small movable items.
Each stage is handled by a team of VLM agents: one proposes, one critiques, one coordinates. The result is not just pretty scenes, but physics-ready environments.
Every object:
•Metric scale
•Collision geometry
•Estimated mass, inertia, friction
•<2% object collisions
•96% stable under gravity
And it exports directly to MJX, USD, SDFormat.
If you train or evaluate robot policies, environment creation is usually the bottleneck. SceneSmith turns it into an on-demand layer. You can generate dozens of diverse scenes per task and automatically evaluate policies across them, with 99.7% agreement to human labels.
That means:
•More robust policies
•Faster benchmarking
•No hand-written success predicates
205 participants preferred SceneSmith scenes 92% of the time for realism and 91% for prompt faithfulness.
Environment generation is no longer the slow part of robot research.
If you work on sim2real, policy scaling, or automated evaluation, this is worth bookmarking and sharing with your team.
📍GitHub: https://t.co/ZAKnAPbcEq
Paper: https://t.co/xAD1zwyydR
Code: https://t.co/ya5DVlkumV
—-
Weekly robotics and AI insights.
Subscribe free: https://t.co/9Nm01QUcw3
We don't have an explicit argument for that. However, the input just gets sent to a VLM agent, which can natively take both text and image inputs. Hence, supporting this seems like a minor code change. You could already use a VLM to describe a set of images in great detail in text and use that. We have been doing that, and it works quite well.