📽️ Check out Visual Odometry Transformer! VoT is an end-to-end model for getting accurate metric camera poses from monocular videos.
https://t.co/6tVXVt6mTx
VoT does not require camera calibration parameters, post-optimization, and operates in real-time, capable of processing thousands of frames. It is trained on a vast amount of real-world indoor data, but can work just fine in outdoor scenarios. It uses only camera poses as supervision - no optical flow, intrinsics, point clouds, or tracks - making it broadly accessible.
We experimented with different backbones, camera pose representations, scalability, and attention mechanisms. Our evaluation spans hundreds of full-length videos across various metrics, without aligning the predicted trajectory to the ground truth, to simulate a real-world application.
Thanks to the team, @kienduynguyen94, @theogevers, @cgmsnoek, and @Martin_R_Oswald from the @UvA_Amsterdam!
📽️ Check out Visual Odometry Transformer! VoT is an end-to-end model for getting accurate metric camera poses from monocular videos.
https://t.co/6tVXVt6mTx
VoT does not require camera calibration parameters, post-optimization, and operates in real-time, capable of processing thousands of frames. It is trained on a vast amount of real-world indoor data, but can work just fine in outdoor scenarios. It uses only camera poses as supervision - no optical flow, intrinsics, point clouds, or tracks - making it broadly accessible.
We experimented with different backbones, camera pose representations, scalability, and attention mechanisms. Our evaluation spans hundreds of full-length videos across various metrics, without aligning the predicted trajectory to the ground truth, to simulate a real-world application.
Thanks to the team, @kienduynguyen94, @theogevers, @cgmsnoek, and @Martin_R_Oswald from the @UvA_Amsterdam!
@bercankilic Difficult. Lots of dynamics happening in the view - this is quite different from the data it was trained on. However, if tuned on egocentric dynamic data, I'm pretty confident it would work
@gabriberton@changh95@AjdDavison Exactly. Another aspect is that they were all relatively simple indoor scenes and we had input depth maps for extra verification
@changh95@AjdDavison For me even small dino features worked better than dbovw for loop closure detection, and there's quite some stuff from @gabriberton
@gabriberton It's a very strong claim. It may be fine for robots, but in the wild reconstruction (e.g. random phone videos) slam (both dense and sparse) is very very far from being solved. It becomes even more obvious when moved a bit further from academic datasets
@maikelborys No. We think that it will be easier for a person who knows ros to adapt it to ros, than for a person who doesn't know ros to adapt it to pure python :)
⏩Code release for MAGiC-SLAM!
https://t.co/GHUbY2s54U
We vibe-coded hard to make the code as simple as possible. Here are some features you can seamlessly integrate into your 3D reconstruction pipeline right away:
Introducing “Gaussian Mapping of Evolving Scenes”! We present an RGBD mapping system with novel view synthesis capabilities that accurately reconstruct scenes that change over time.
https://t.co/9xz7zvf0xx
One of our CVPR highlights 👉 Meet MAGiC-SLAM: multi-agent SLAM powered by rigidly deformable 3D Gaussians for novel view synthesis. New tracking, map-merge & loop-closure kill drift, align maps, and run faster + more accurately than 2-agent baselines on synthetic & real data. @vyuga3d@theogevers@Martin_R_Oswald #CVPR2025
This work was conducted in collaboration with Kersten Thies, @lucacarlone1 , @theogevers , @martinoswald , and Lukas Schmid at the Computer Vision Group of the @UvA_Amsterdam and @MIT Spark Lab.