Video2Humanoid. Still trouble with bad retargeted humanoid motions?
Humanoids now are Easy to track any video using our new Neural Motion Retargeting (NMR) Method.
Optimization-based motion retargeting methods like IK and GMR are solving a non-convex problem frame by frame, which makes them sensitive to initialization, hard to tune, and prone to poor local minima. The result is familiar: joint discontinuities, self-collisions, and unstable foot contact.
Our solution is simple but effective : instead of optimizing each frame independently, learn the mapping between human motion and robot motion as a distribution via networks.
📊Results on Unitree G1
- 0 joint jumps
- 54% fewer self-collision frames
- 61% fewer joint limit violations
- Faster convergence for downstream control policy training
NMR turns motion retargeting from a fragile optimization problem into a learned, scalable pipeline for more stable humanoid motion.
please visit https://t.co/mtyhgnT56Q for more visualization results.
opensource: https://t.co/H01K7mFMQT
Tencent presents Yan: Foundational Interactive Video Generation.
It has been only two months since our release of Self-Forcing, and there are already two world foundation models built on top of it.
Chinese teams are building at the speed of light!
https://t.co/HkmxPiiDvb
📣Matrix-3D: Omnidirectional Explorable 3D World Generation📣
Matrix-3D is a 3D world model which generates large-scale explorable 3D scenes from a single image or text prompt, with SOTA performance.
- Project: https://t.co/Be94fgBtEs
- Code: https://t.co/fdNNyS6jX6
🔎Excited to share our PDT: Point Distribution Transformation with Diffusion Models (SIGGRAPH2025)!
💡While autoregressive models are now widely adopted for predicting structures, we find diffusion models can also reveal high-level structures by transforming point distributions!
Special submission experience.
A co-authored ICLR submitted paper was rejected with all positve final reviews. Reviewers agreed main concernings were solved and raised scores after rebuttal, but AC still rejected it.
Excited to share our work DrivingWorld, a GPT-style Autoregressive (AR) video world model for autonomous driving.
Rather than simply employ the next-token-prediction strategy, we leverage next-frame-prediction to model temporally coherent frame-wise information and then use next-token-prediction to interpolate the content of each frame. As a result, our model achieves accurate future prediction (up to 40 seconds) based on the inputted vehicle's locations (left subfig of the video) and a few conditioned images (non-red contourred frames at the beginning).
Codes and more demos can be found at https://t.co/cnvMVoXi2M
EnvGS: Modeling View-Dependent Appearance with Environment Gaussian
Contributions:
• We propose a novel scene representation for accurately modeling complex near-field and high-frequency reflections in real-world environments.
• We developed a real-time ray-tracing renderer for 2DGS, enabling joint optimization of our representation for accurate scene reconstruction while achieving real-time rendering speeds.
• Extensive experiments show that EnvGS significantly outperforms previous methods. To the best of our knowledge, EnvGS is the first method to achieve real-time photorealistic specular reflections synthesis in real-world scenes.
Very smart idea! Congrats to @Lyxun2000 . Rather than heavy 3D triplane transformer arch, a pure 2D diffusion-based model for generative gaussian splatting. High-quality and efficient!
LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images
Contributions:
• We train a network to map images to a novel Relative Coordinate Map (RCM), which embeds pixel-wise correspondences across different input views to a main view and can be converted to its point cloud representation.
• We demonstrate that RCMs can be easily obtained by fine-tuning a pre-trained 2D network to capitalize on existing 2D foundation models.
• We showcase the superior quality of our flexible method, enabling rapid 3D generation
from mobile phone image captures within seconds.
LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images
Contributions:
• We train a network to map images to a novel Relative Coordinate Map (RCM), which embeds pixel-wise correspondences across different input views to a main view and can be converted to its point cloud representation.
• We demonstrate that RCMs can be easily obtained by fine-tuning a pre-trained 2D network to capitalize on existing 2D foundation models.
• We showcase the superior quality of our flexible method, enabling rapid 3D generation
from mobile phone image captures within seconds.
Surf-D will be posted today. Please check out with our handsome boy @frankzydou if you're at ECCV 2024 @eccvconf!!
👗 Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
📅 Wed, Oct 2 | 16:30 - 18:30 | Poster Session 4 | Poster 285
Lotus
Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.
If you're at ECCV 2024 @eccvconf, please check out our recent works on shape and motion generation.
3D Shape Generation with Arbitrary Topologies
👗 Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
📅 Wed, Oct 2 | 16:30 - 18:30 | Poster Session 4 | Poster 285
paper: https://t.co/ZxQeBtFerM
Efficient Human Motion Generation
🏃🏻♀️ EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation
📅 Wed, Oct 2 | 10:30 - 12:30 | Poster Session 3 | Poster 227
paper: https://t.co/u1s71rlXRb
Controllable Human Motion Synthesis
🤸🏼♂️ TLControl: Trajectory and Language Control for Human Motion Synthesis
📅 Tue, Oct 1 | 16:30 - 18:30 | Poster Session 2 | Poster 273
paper: https://t.co/HYE5gZt53h
Avatar Generation
💃 Disentangled Clothed Avatar Generation from Text Descriptions
📅 Fri, Oct 4 | 10:30 - 12:30 | Poster Session 7 | Poster 301
paper: https://t.co/lVzu4ZCfHZ
- Please feel free to ask any questions. I am excited to explore and discuss all the interesting ideas during the poster session.
I will be attending the Doctoral Consortium on Oct 1 (12:00 - 14:00). Looking forward to it!
🗺️Dynamic Realms: 4D Content Analysis, Recovery and Generation with Geometric📐, Topological🔗and Physical Priors⚛️
#ECCV2024 #Milan #Milano #ECCV #Vision #Graphics #GENAI #generativeai
Congrats to Linhan for his first NeurIPS paper.
DC-Gaussian could reconstruct high-quality 3D Gaussain Splatting via shitty cameras shooting through windshield (car dash cameras)😀.
Codes have been released.
See project page here: https://t.co/F6DMTOAg6q
Thanks for the tweet.
We improve 3D Gaussain splatting for dash cam videos that contain serious reflections and obstructions.
With Dash cam videos, it's possible to leverage massive driving data.
Visit our page at https://t.co/F6DMTOzIgS
Codes will be soon.
An image + a detailed normal map generated by GeoWizard --> a relightable image.
Geowizard is a generative geometry estimation model that could jointly produce high-quality depth and normal with intricate details.
Try our huggingface demo here: https://t.co/cNVburSygp
#huggingface #gradio #geowizard
🔥Exciting update to the HuggingFace demo of Metric3D, the most powerful monocular metric depth predictor😱! New feature: directly view reconstructed 3D scenes alongside depth maps on the web page. 😎Check out these enhance at https://t.co/WB9xXvkIqr