I’ve recently left the Tesla Optimus team to co-found Mondo Tech with my longtime friend and former DJI colleague, Soren. I’m incredibly appreciative of the opportunity to contribute to Elon’s vision for general-purpose humanoid robots. Optimus has pushed both technical frontiers and society’s imagination of what it might look like for humans and robots to coexist more closely. I’m still a firm believer, but I’m chasing a different path.
We’d like to test the hypothesis that smaller, more accessible robots, designed with consumer applications in mind, can be both meaningful products today and foundational platforms tomorrow. These systems will be compact, safe, user-friendly, and useful, capable of running modern machine learning algorithms on a fully self-developed embedded and mechatronic stack.
We’re building teams in both Palo Alto and Shenzhen. If you’re passionate about embedded systems, ML/RL, and you want to build mini robots that can serve as little friends in people’s lives, we’d love to hear from you.
That said, what truly cemented this decision is something more personal:
Our son is almost three. He loves stories. Every night, I tell him adventures about a boy named Pike and his robot companion Nick. Nick helps fix trains using his own parts, scouts mountains (adapting into a drone), and stays up all night when Pike is sick. These stories came from a place of love and hope that he might grow up in a world where robots aren’t just tools, but friendly companions. My son never met Optimus, though he used to watch Optimus videos with me. But when I had to leave him crying to return to the lab, he gradually lost interest. Most nights, after storytime, I’d drive back to work, tuning Optimus algorithms in the Tesla lab, while watching him sleep on the baby monitor, wondering what future I was building for him.
One night, I told him:
“Daddy’s not working on that robot anymore. I’m going to build you one, like Nick. One that can be with you every day.”
He looked at me, smiled, and said, “Okay.”
And that was all I needed.
We are back. After one year of quiet building.
Introducing GENE-26.5, our first robotic brain that takes a major step toward human-level capability.
For years, robotics has struggled to learn from the world’s largest and valuable data source: Humans.
Solving it means rethinking the whole stack from the ground up:
- A robotics-native foundation model.
- A 1:1 human-like robotic hand.
- A noninvasive data collection glove for motion, force, and touch.
- A simulator that turns weeks of experiments into minutes.
GENE-26.5 is trained across language, vision, proprioception, tactile, and action. We designed a set of tasks to test how far we can go with this new paradigm.
Fully autonomous, 1x speed, one model, same weights. (Enjoy with sound on)
We are approaching the endgame for robotics.
And this is just a beginning.
DiT4DiT is now open source! As the first humanoid-deployable Video-Action Model built on a world model, DiT4DiT continues to surprise us. In our paper last month, we showed its strong data efficiency. Now, with only slight modifications, it enables real-time whole-body autonomous pick-and-place.
Paper: https://t.co/tjQipiYpBS
Code: https://t.co/7pacr0mJk3
Website: https://t.co/YFZUTCTuUr
Additional to a team of people who can deploy world model based manipulation model on humanoids (Please read https://t.co/hWAkwMmsFp), Mondo also has a team of people who can build the best consumer robot in the world. You will get this robot at a low price soon.
@AndreTI it’s not a bug. It is because the VAM is very slow to infer so the robot receives discontinuous trajectories despite of smoothing mechanism. We need to improve inference time by compressing model
We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos.
Project: https://t.co/YFZUTCTuUr
Paper: https://t.co/tjQipiYpBS
Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.