@khnh80044 Yes, but I definitely wouldn’t want my children to succeed that way. It just seems too painful, and there are so many other paths to success.
VLA-JEPA just dropped in LeRobot 🤖
What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics.
During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos.
At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head.
The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on @NVIDIARobotics DGX Spark!
VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀
@Thom_Wolf@ClementDelangue
Well, I still don’t get the point: why does everything seem to revolve around the final destination, which is video generation?
I mean, if the goal is to make something human-like, then representation alone should be enough, and representation should be the real goal. Humans do not “generate” videos. We act based on our internal representations of the world.
@AdaFang_ great work!
With this I can train the next gen JEPA based model.
And post the results just by giving it a simple task description.
Thank you and your team