[1/6] Ego-centric World Models
We introduce EgoWM — a video world model that simulates EVE-1X humanoid interactions from a single ego-view image + full-body joint angle trajectories.
Moreover it effortlessly generalizes to extreme OOD domains, including paintings !
Excited to share our project - Sim2Reason!
Key Insight: Simulators are an untapped source of cheap supervision for scientific reasoning. LLMs can learn physical reasoning from simulation to improve on real world benchmarks such as the International Physics Olympiad!
In my recent blog post, I argue that "vision" is only well-defined as part of perception-action loops, and that the conventional view of computer vision - mapping imagery to intermediate representations (3D, flow, segmentation...) is about to go away.
https://t.co/aFmE9CHHau
@zhihelu1 Thanks! for world models, more precisely, forward dynamics models (current state + action -> future state), this is standard formulation. There are lots of model-based control approaches that can be used to plan/predict actions using such world models.
[1/6] Ego-centric World Models
We introduce EgoWM — a video world model that simulates EVE-1X humanoid interactions from a single ego-view image + full-body joint angle trajectories.
Moreover it effortlessly generalizes to extreme OOD domains, including paintings !
[5/6] Temporal compression
Unlike prior works, we preserve full-sequence diffusion and compress actions to the latent temporal resolution. EgoWM achieves +42% better action alignment at +4s horizon vs. frame-wise autoregressive NWMs
even with 4× temporal compression (Cosmos-2B).
My role at Meta's SAM team (MSL, previously at FAIR Perception) has been impacted within 3 months of joining after PhD.
If you work with multimodal LLMs for grounding or complex reasoning, or have a long-term vision of unified understanding and generation, let's talk.
I am on the job market starting immediately.
#metalayoffs #FAIR #MSL #SAM
[ICCV 25] Refer Everything Model (REM)
(1/6) We leverage Text-to-Video Generation models to zero-shot segment any concept in a video using text. REM generalises to dynamic concepts like smoke, light-beam and more without ever having seen segmentation masks for these entities.
[ICCV 25] Refer Everything Model (REM)
(1/6) We leverage Text-to-Video Generation models to zero-shot segment any concept in a video using text. REM generalises to dynamic concepts like smoke, light-beam and more without ever having seen segmentation masks for these entities.
(6/6) We’re at the start of the internet-scale "video" era, and the possibilities are exciting. Learn more at https://t.co/ERyk8Cfpst — our code & model weights are available. Visiting ICCV? Come see our poster on Oct 23 to chat and see results in action!
(5/6) REM demonstrates how Text-to-Video generation can serve as a powerful pre-training paradigm for downstream video understanding. The days of large-scale, labor-intensive video annotation may soon be behind us — pre-train to generate, fine-tune lightly to understand.