Planning with the views:
Can VLMs predict how each camera move changes the view, and plan many such moves ahead?
We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing:
Path-to-View
View-to-Path
Interactive View Planning
A sharp Planning Gap emerges:
+ can roughly "track" how camera action changes views
- cannot "compose" a plan towards a target view at all
We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B.
+ With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%
Below, we answer these questions:
Q1. What are the failure modes?
Q2. How can we make RL work?
Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?
Led by @James_KKW, great to work with @LINJIEFUN@zhengyuan_yang@shiqi_chen17@wzenus@drfeifei@jiajunwu_cs Leonidas Guibas, Lijuan Wang.
A joint efforts with @StanfordAILab@StanfordSVL@MSFTResearch.
Your thoughtful reflection is so inspiring and encouraging @smallfly ! As everyone talks about AI and automation, human creativity, story telling and productivity are even more important and essential to our society. @theworldlabs is founded on the premise of empowering human ingenuity and productivity. We are very grateful to be able to work with people like you! 🙏🌐
@FastCompany just published a great piece on @theworldlabs , @drfeifei , Marble, and the idea that spatial intelligence / world models may be one of the next big shifts in AI.
I was happy to be quoted in the article, but I also wanted to share more context about my own experience with World Labs and Marble, and why this direction is especially interesting to me.
https://t.co/mdWBmSuNBe
My starting point: volumetric capture
—
For the past few years I’ve been exploring and using volumetric capture and reconstruction (photogrammetry, NeRFs, 3D Gaussian Splats) mostly capturing locations around Montreal. Alleys, museums, urban interiors.
I love every step of it: the capture itself, the pipeline, and what can be done with the output. Turning real spaces into real-time explorable systems.
I do this personally, sharing explorations here, and professionally as chief technologist, and co-founder of Dpt.
Physical reality + generative manipulation
—
In my work I’m especially drawn to mixing physical reality with generative and digital manipulation: using physical interfaces (light, clay, ink, ... ) to drive generative AI pipelines, building mixed reality prototypes that reshape your surroundings, or starting from real captured spaces and transforming them using tools like Marble.
Like many people, I saw the World Labs announcement on Twitter in September 2024, and Marble when it surfaced in early December. But by then, I already had a sense something was coming.
The first conversation
—
As someone deep into volumetric capture and radiance fields, I obviously knew about @BenMildenhall and his pioneering work on NeRF. To my surprise, Ben reached out to me in late June 2024. He’d been following some of my experiments and wanted to chat about my process and workflows and how I was using this “stuff” creatively.
At that point he didn’t share what he was building, but we had a genuinely great conversation about radiance fields, AI, and my work. He was curious about the creative perspective, not just the technical one.
When the World Labs announcement dropped a few months later, it all made sense. I understood what Ben had been working on, and why the creative angle mattered to them. Then in August 2025, he invited me to try the Marble beta, and I’ve been experimenting with it since.
Experimenting with Marble
—
The first thing I used Marble for was materializing scene and world concepts during ideation at the studio, and seeing if and how it could fit into our production pipeline. In parallel, I dove into a series of experiments focused on world manipulation: starting from real captured spaces and transforming them using Marble.
I’d already been exploring that idea using img2img diffusion with ControlNet on NeRF renders, real-time video streams, and even mixed reality using headset camera feeds. But Marble brings something different. It generates persistent, spatially cohesive 3D worlds that can be rendered in real time across a wide range of devices.
That’s a real shift.
Experiment 01: Parallel Realities
—
The first experiment, Parallel Realities, starts from a volumetric capture of a real location, reconstructed as 3D Gaussian Splats. Using Marble, I generate an alternate version of that same space, something informed by the original architecture: abandoned, nature-reclaimed, alternate era.
Then, using Spark (World Labs’ 3D Gaussian Splatting renderer for THREE.js) I make both realities coexist in the same spatial coordinate system. From there, I use a portal UX mechanic to let the user step between the real reconstruction and the Marble-generated version.
Experiment 02: Hidden Depth
The second experiment, Hidden Depth, does not transform a space as much as expand it.
A captured location has a visual boundary (a mural, a doorway, a dark corridor) and Marble generates what exists beyond it. For example: a Montreal alley has a painted mural; step through it and you’re inside a world informed by what is actually depicted there.
World Labs showcased part of this work here:
https://t.co/0RQTDWsgs2
And in their Spark 2.0 post:
https://t.co/X34yzkLBOm
The project page is here:
https://t.co/T6Qxuuq9RJ
Why this matters to me
—
Being able to start from a real 3D Gaussian Splat scene and manipulate it with Marble opens up a lot of ideas. The 3DGS pipeline is becoming an increasingly compelling foundation for exploration, experimentation, and storytelling.
What matters most to me right now is more control. The more I can steer the generated scene or world, the more useful the tool becomes. I want more features like the already existing multiple input images and Chisel, the blockout-based approach.
I would like better local control, the ability to expand a generated world more and more while preserving coherence, and the ability to directly import 3D Gaussian Splat scenes to be used as a starting point. I want more ways to shape the result, not just a “prompt and hope” approach.
—
It is exciting to see this field moving from research and demos toward actual creative workflows.
HAI Founding Director @drfeifei is featured on @FastCompany's cover, explaining "world models" – AI that understand physical space and real-world dynamics. Rooted in human-centered philosophy, she explains what makes it different and what's at stake: https://t.co/pHlx1M5nQt
The future of AI should be grounded in human agency, creativity, and understanding.
@FastCompany explores the rise of world models and features insights from World Labs cofounder @drfeifei:
"Her vision for World Labs—and its human-centered future—is both consistent and persistent. It's like a simulation in her world model. Once in place, it stays put."
Read the full article here ↓
3D is an exciting area where we are still figuring out the right tasks, problem formulations, architectures, and the best ways to scale.
We're sharing some of our ideas here in our first-ever papers from @theworldlabs, led by an awesome set of interns.
@chlassner It's been awesome working with you Christoph! Best wishes to your recovery ❤️🩹 And I look forward to working with you in the next chapter!😍🌐
Today we are sharing three new research papers, each exploring a new way to generate 3D content by leveraging large-scale generative models and 2D priors.
These projects were led by our incredible interns @HaoZhang623@BDuisterhof@DrTunnels
[1/4]
Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.
The creativity and imagination is out of the world! So grateful that @theworldlabs got to partner with the amazing talents @withloreco to translate their incredible ideas into an interactive experiences for users to enjoy!🤩
The CoRL 2026 keynote lineup is here!
🔹 Russ Tedrake — MIT; stealth startup @RussTedrake
🔹 Fei-Fei Li — Stanford; World Labs @drfeifei
🔹 Wolfram Burgard — UT Nuremberg @wolfram_burgard
Join us in Austin this November.
https://t.co/uiOkizDNIc
Excited to introduce StereoPolicy, led by @EvansXuHan.
📷📷🤖StereoPolicy is an effective way to add geometric cues to modern robot policy models while keeping the strengths of pretrained 2D encoders.
⁉️Why stereo for robot manipulation?
Monocular RGB often lacks the depth cues needed for precise manipulation, while RGB-D and point clouds can be noisy or brittle, especially on reflective and transparent objects in real-world deployment.
Instead of explicitly reconstructing disparity, depth, or point clouds, StereoPolicy directly fuses synchronized left/right RGB views to learn implicit stereo cues, avoiding extra reconstruction latency that can make real-time manipulation difficult.
Project Page: https://t.co/e07jsbKJg5
GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epochs on ImageNet, but is a much better proxy for real-world problems. If you work in generative modeling, try GPIC for your next project!
1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!
🚀100M VLM-captioned image-text pairs for training
📊1M image-text pairs for benchmarking
🖼️~28 trillion pixels
🤗Centrally Hosted
✅Fully permissive for research + commercial use
Dataset, benchmark and models🧵👇
Co-led with @KyleSargentAI