Excited to share our latest work: One-Forcing! ๐
We compressed standard 4-step autoregressive video generation into just 1-STEP, achieving a 4x theoretical speedup for real-time generation! ๐
The crazy part? Our 1-step model outperforms strong 4-step baselines on VBench! ๐ฅ
๐ Paper: https://t.co/Saza8LMuAx
๐ Project: https://t.co/NasGwqo9wC
โญ Code: https://t.co/p4OWjg39wh
@qinzytech The gap between ICL and BP is too large. The former is heavy and the latter is light. For human, itโs a middle state. How to design it in AI?
I wonder about the relationship between learning efficiency and generalization? In my observations, recent methods can't do both well.
learning too fast maybe means overfit. But for real intelligence they come together.
So JEPA/LeWorldModel and other representation-based method claim they have few-shot learning ability. But are we traped in the clean world? I agree with compositional generalization. But that's not how LLM works.
Aristotle and Plato are two different paradigm, down2top and top2down respectively. What will be the answer for the robotics?
VLA-JEPA just dropped in LeRobot ๐ค
What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics.
During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos.
At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head.
The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on @NVIDIARobotics DGX Spark!
VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last ๐
@Thom_Wolf@ClementDelangue
@andrew_n_carr The most informative thing is our world. But itโs definitely the ONLY probability. So the key is to interact and learn from world. Human is an adoption of the nature
@peterpaohuang@Stanford_AI_Bio This reminds me of the deep theory of gauge chose in physics. Maybe w could build more complex math structures than vector field
@vincesitzmann For AR we use embeddings; for diffusion we use encoders/decoders. Yet for hybrid AR-diffusion models like recent world models, we know too little about what makes a good encoder.