Introducing LaRI (https://t.co/4dfmwXKZHw), a📸single-view,🚀single-feed-forward method to model🙈unseen 3D geometry using layered point maps. It
✅seamlessly extends depth estimation
✅unifies object- & scene-level reasoning
✅builds training & eval datasets
Details👇
Zang et al., "World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible"
A Diffusion Transformer that estimates multiple layers of depth to further estimate occluded parts as well.
Natural images often already implicitly contain depth information — hidden in bokeh effects.
Can we leverage the rich depth cue widely exist in natural images for depth estimation?
We explore this in our recent project BokehDepth (ICML 2026).
- Stage 1: A generative model produces calibrated bokeh stacks from the input image.
- Stage 2: The bokeh stacks are integrated into a depth prediction model to estimate depth.
We believe it highlights bokeh effects as an important and effective complementary cue for monocular depth estimation.
🌐 Project page: https://t.co/vcfTSHFZn6
📄 arXiv: https://t.co/gqGgesXIrO
👨💻 Code: https://t.co/KNs46UGXVu
Transformers have succeeded in modeling phenomena traditionally associated with computer graphics, such as 3D visual effects (e.g., RayZer) and rendering processes (e.g., RenderFormer).
A natural question is whether they can also tackle the challenging task of cloth simulation.
We introduce 👕𝗖𝗹𝗼𝘁𝗵𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿, a Transformer-based method that reformulates cloth simulation as autoregressive next-state prediction in a learned latent space.
It handles diverse scenarios under a single model, with 4-9x lower error than prior SOTAs:
• Body-driven garments
• Robotic manipulation
• General cloth–object collisions
We believe it highlights the potential of Transformer-based autoregressive models as a powerful alternative to conventional simulation approaches.
This work is mainly led by my student Yu Zhang @yucrazing
🌐 Project page: https://t.co/50cCU6i7GT
📄 arXiv: https://t.co/z6yBNiANpy
The current paper submission and review process seems unlikely to survive LLMs. One alternative would be to build a new process around talks: "submission" is making and giving a 30 minute live talk, and "review" is three experts watching, evaluating, and asking questions.
🎺Meet VIST3A — Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator.
➡️ Paper: https://t.co/sFqbbUiGOO
➡️ Website: https://t.co/QWMLwXyVcB
Collaboration between ETH & Google with Hyojun Go, @DNarnhofer, Goutam Bhat, @fedassa, and Konrad Schindler.
🚀Excited to share our recent work on test-time scaling for feed-forward Gaussian splatting:
we learn a recurrent model ReSplat that is able to iteratively improve the reconstruction quality in a feed-forward manner!
https://t.co/OB38xC7PjC
Interesting ICLR submissions 🤩
Depth Anything 3 - My TLDR: Init multi view transformer of VGGT with later layer DINO weights and use teacher model trained on synthetic data only for pseudo labelling real world datasets
https://t.co/tPXAqH61B8
Trace Anything - My TLDR: VGGT like model predicting N view geometry and motion as a trajectory field represented using splines and control points
https://t.co/OUcLJeMgjN
The field is evolving very fast!
🚀 The #ICCV2025 Award Candidate Papers are out! 🚀
From 2,701 submissions, only 13 were selected, spanning 3D vision, generative models, foundation models, and more.
Key highlights at a glance 👇
(12/13) Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability
TL; DR: A ground-truth-free method (PCR) that evaluates object detectors via prediction consistency and confidence reliability.
📃Paper: https://t.co/3oftIkN1CZ