We are grateful to all of the 17,491 reviewers who helped make #CVPR2026 possible. We are especially pleased to recognize the following Outstanding Reviewers, whose high-quality reviews (as judged by their Area Chairs) placed them among the top 5% of reviewers.
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
https://t.co/AFJZ5kH7Ku
I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;)
And stay till the end, more easter eggs and predictions for your polymarket!
00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum!
01:42 The Great Parallel
03:31 Robotics, the Endgame
03:39 Why VLAs fall short
04:32 Video world models as the 2nd pretraining paradigm
06:09 World Action Models (WAM)
07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation
11:06 EgoScale and the Dexterity Scaling Law we discovered recently
14:00 Physical RL: bridging the last mile
15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico
17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think.
Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.
We are back. After one year of quiet building.
Introducing GENE-26.5, our first robotic brain that takes a major step toward human-level capability.
For years, robotics has struggled to learn from the world’s largest and valuable data source: Humans.
Solving it means rethinking the whole stack from the ground up:
- A robotics-native foundation model.
- A 1:1 human-like robotic hand.
- A noninvasive data collection glove for motion, force, and touch.
- A simulator that turns weeks of experiments into minutes.
GENE-26.5 is trained across language, vision, proprioception, tactile, and action. We designed a set of tasks to test how far we can go with this new paradigm.
Fully autonomous, 1x speed, one model, same weights. (Enjoy with sound on)
We are approaching the endgame for robotics.
And this is just a beginning.
I gave an award talk @3DVconf that might be interested to some people. I took a step back and shared a few personal stories from my 10-year journey, reflecting on the profound impact of people, luck (you need a lot!), grit, and the art of giving up. (1/2)
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.
This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:
- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.
This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.
https://t.co/WAz8aIztKT
All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.
And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
For video generation in robotic applications, looking pretty is usually not enough.
Robot manipulation requires understanding how visual observations and 3D geometry evolve over time under agent actions, with temporal coherence and geometric consistency across camera views.
We study this challenge in our work (recently accepted by @iclr_conf ), 4D Video Generation for Robot Manipulation, which enforces multi-view 3D consistency via geometric supervision to generate spatio-temporally aligned videos.
Sim-to-real learning for humanoid robots is a full-stack problem. Today, Amazon FAR is releasing a full-stack solution: Holosoma.
To accelerate research, we are open-sourcing a complete codebase covering multiple simulation backends, training, retargeting, and real-world inference.
I have met many students and young researchers lately who claim to be working on World Models or Embodied AI but do not even know the basics of 3D Vision or linked rigid body motions. When did we start to give students the illusion that they can *do* things right without *learning* anything right?
CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation
Contributions:
• Experimental Setup and Benchmarking: We develop a comprehensive experimental setup designed to evaluate city-scale, text-based 6DoF localization.
• Novel Approach for Text-Based 6DoF Localization: We propose a diffusion-based method for text-based 6DoF localization that operates effectively at the city scale.
• Pose Refinement Technique: We employ Gaussian splatting rendering for pose refinement, filtering out poorly matched poses and optimizing them by maximizing cosine similarity with text features. This guides the pose to the most relevant location for the given text description.
• State-of-the-Art Results: Our approach delivers superior performance, surpassing baseline methods in both pose estimation accuracy and distribution modeling.
🔬 Researchers from INSAIT, ETH Zurich, University of Amsterdam and the Università di Pisa and Trento created the first of its kind large-scale dataset for understanding 3D Gaussian splats. Links in comments!
🎉 Congratulations to all authors!
Looking for a PhD at the frontier of Human-centered Artificial Intelligence? 19 newly added open positions at Italy's National Phd on AI! https://t.co/qLKiSX2AUa