[1/6] Ego-centric World Models
We introduce EgoWM — a video world model that simulates EVE-1X humanoid interactions from a single ego-view image + full-body joint angle trajectories.
Moreover it effortlessly generalizes to extreme OOD domains, including paintings !
[ICCV 25] Refer Everything Model (REM)
(1/6) We leverage Text-to-Video Generation models to zero-shot segment any concept in a video using text. REM generalises to dynamic concepts like smoke, light-beam and more without ever having seen segmentation masks for these entities.
Check out our latest work on Learning to Track with Object Permanence https://t.co/O2OHBTOU6J! We use synthetic data from @parallel_domain, which provides dense ground truth annotations, including for invisible objects, to learn to track behind occlusions.
Ablation analysis demonstrates the importance of dataset size for learning this challenging behaviour. Collecting such a dataset in the real world would be extremely expensive, further emphasizing the value of synthetic data in ML research. @ljlijie, @adnothing, @ToyotaResearch
Our method is fully online, runs in real-time, and is end-to-end trainable. The resulting model is transferred to real videos with a simple, data-efficient approach, and outperforms the state-of-the-art on KITTI and MOT17 benchmarks by significant margins.