Introduce CRISP, a real-to-sim pipeline that recovers human motion and simulatable scene geometry from monocular video!
CRISP builds contact-faithful 3D scene for simulation - 8ร fewer sim failures, +43% faster sim, and improves human motion!
Interactive demos๐: https://t.co/locrdrxO16
Exciting collaboration w/ @JiashunWang@jefftan969@_Tsukasane @ Jessica Hodgins @shubhtuls@RamananDeva
Children learn from play. Can robots do the same?
We propose ๐๐ฅ๐๐ฒ๐๐ฎ๐ฅ ๐๐ ๐๐ง๐ญ๐ข๐ ๐๐จ๐๐จ๐ญ ๐๐๐๐ซ๐ง๐ข๐ง๐ , a paradigm that gives embodied coding agents a play stage before downstream tasks arrive, and instantiate it with ๐๐๐๐ฌ (Robotics Agent Teams), where robots discover reusable skills through curious play.
Co-led with @jiaxin_ge_
Reference motions are often used as trajectories to track or teachers to distill. We explore a different way of learning from them.
I am excited to share our work, Generalizing from References (GfR), to appear at RSS 2026, as a follow-up to our previous HIL work.
Using a unified multi-task RL framework, we jointly train reference-guided imitation and goal-driven RL within a single end-to-end policy.
No distillation.
No RL fine-tuning.
Just one policy, trained end-to-end, that learns from references and generalizes beyond them.
Rather than treating reference motions as trajectories to track, distill, or follow, we use them to shape behavior while allowing RL to explore and adapt beyond the references.
In the following example, without human joystick control, the robot can autonomously compose learned skills using only task goals.
๐ https://t.co/pdMWBWgtCY
๐ค Things beyond locomotion coming soon.
Introducing ABC: open data, training, and infrastructure for robotics.
We release the largest teleop dataset to date, and extensively investigate design decisions, pretraining, and post-training techniques.
@arthurallshire@Cinnabar233@adamrasb@redstone_hong@davidrmcall
Why arenโt Diffusion Language Model smart yet? Lacking stable post training is a major bottleneck!
Meet DiPOD: the tripod for diffusion model post-training.
DiPOD boosts accuracy across reasoning tasks, with Sudoku jumping from 22% to 97%, through a one-line code change.
๐งต1/5
Introducing Modality Forcing, a recipe for post-training T2I models for SOTA RGB-Depth generation!
Text-to-image (T2I) models learn rich representations of the spatial world.
How do we build on this prior for high-quality depth generation?
https://t.co/uJjGHNiDBu
๐งตย [1/6]
Over the past few years, motion tracking has largely taken over humanoid whole-body control. Most motion tracking methods rely on explicit phase variables or future target poses to track reference motions.
But, do we actually need them?
We find that task conditions and scene observations alone can already provide enough structure for reference motion tracking. Building on this observation, we introduce HIL: Hybrid Imitation Learning.
Using a unified goal-conditioned observation space, we formulate motion tracking and adversarial imitation learning as a single end-to-end multi-task learning problem.
This allows a single policy to simultaneously:
โข track reference motions with high fidelity
โข compose and adapt skills through adversarial imitation learning
By sharing the same observation representation across both tasks, behaviors learned from motion tracking naturally transfer to more general goal-conditioned control.
๐ To appear in ACM Transactions on Graphics (TOG 2026) & SIGGRAPH 2027
๐ https://t.co/MBb9j1U6Sk
๐ค A real-world humanoid follow-up is coming soon
๐ช What if humanoids could climb ladders and work on them straight out of simulation?
Meet LadderMan: a perceptive system for zero-shot sim-to-real ladder climbing and on-ladder manipulation.
Watch the humanoid climb, stabilize, and manipulateโall in one system. ๐ค๐
I'll be presenting E-RayZer at the VGI workshop (https://t.co/KKKzNoybKJ) as an invited poster (Wed 12:20-13:30, Room 703), and at the main conference poster session as a Highlight paper (Fri 4:00-6:00, ExHall A & F 33).
Come chat if you're interested!
๐ Excited to share REST3D: REconstructing physically STable and visually consistent 3D scenes from a casual single image๐คณ.
With REST3D, you can naturally interact with stable virtual objects through hand-based VR interactions๐.
๐ Project page: https://t.co/1CVuGIjAVM
๐ Introducing Articraft, a coding agent for articulated 3D asset creation.
Articraft writes code, executes it, receives validation feedback, and refines the result into simulation-ready 3D assets with parts, joints, and motion.
Weโre also releasing Articraft-10K: 10,000+ articulated objects across 250 categories, unlocking large-scale interactive scenes for robotics simulation and physical AI.
๐ Project page: https://t.co/FWutv61yx7
๐ป Code: https://t.co/CpCYdBzMlv
๐คLow-data post-training can teach a VLA policy a new robot skill. But it also makes it too attached to the training demos.
We call this lock-in๐: the policy can execute the post-training task, yet fails to respond to seemingly obvious prompt changes.
DeLock preserves steerability using only the policyโs own pretrained knowledge. No extra supervision needed!๐๐๐
#Robotics #AI #EmbodiedAI #VLA
What is missing to bring real-time motion research into AAA games and real-world robotics?
We present MotionBricks, a step toward bridging this gap with two key components:
- a single generative latent motion backbone covering 350,000+ motion skills, running at 15,000 FPS with 2 ms latency and substantially improved quality and reliability.
- a unified smart primitive interface for locomotion, object / scene interaction, with fine-grained control over generated behaviors.
Webpage: https://t.co/aJE5skUuWD
Code: https://t.co/r56D3TJ8CW
Paper: https://t.co/CtOHXnHZMv (ACM TOG / SIGGRAPH 2026)
Before AI can generate professional videos, it needs to see like a professional.
We spent a year with 100+ content creators teaching AI to describe video like a filmmaker would.
Introducing CHAI: Critique-based Human-AI Oversight for Building a Precise Video Language [CVPR'26 Highlight, Top 3%].
Try prompting a video generator for a dolly zoom, dutch angle, point of view, or camera roll. Most fall back to the same bland defaults: a push-in, a level shot, a third-person view. Why? These techniques require a language of cinema that current models rarely speak.
We built that language:
1๏ธโฃ Precise specification: 5-aspect structured captions co-designed with professional cinematographers covering subject, scene, motion, spatial, and camera dynamics
2๏ธโฃ Scalable oversight: LLMs draft captions, humans critique what's wrong and how to fix it
3๏ธโฃ Post-training recipes: Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5
4๏ธโฃ Video generation: fine-tuned Wan follows 400-word cinematic prompts with precise control
Here's how each works ๐งต
Work led by CMU and Harvard with @chancharikm, @du_yilun, and @RamananDeva.
๐ Paper: https://t.co/wCwEtvrntM
๐ Site: https://t.co/oAAQklGrfF
[1/7] Video diffusion has come a long way, generating more & more realistic videos.
Can we revisit sparse-view novel view synthesis through these video priors?
Meet FrameCrafter: a permutation-invariant multi-view model built on video diffusion ๐งต
๐ https://t.co/ogEN4mkE92
Very excited to share this work @davidrmcall did with the fantastic NVIDIA Finland team last year. We have a surprisingly simple, but sample efficient way to post-train a flow model with RL.
Most multi-view reconstruction models need full supervision. We show they can self-improve without any ground truth labels.
Introducing SelfEvo: Self-Improving 4D Perception via Self-Distillation. Up to +36.5% in video depth, +20.1% in camera estimation, zero annotation.