The biggest impact of the Segment Anything line of work was not the actual image segmentation, but rather the flood of paper titles with the name “Any” in them. Cmon folks, let’s just call this generalization and move on!!
@MattNiessner I agree that metric 3D is critical! It’s a compressed, minimal representation. Playing devils advocate — humans also operate on projections of the 3D world and we are able to operate pretty well
How do you decompose a 2D image into accurate 3D object detections? You use🥊Boxer.
A new model from Reality Labs Research enables robust 3D object detection by "lifting" 2D proposals from off-the-shelf detectors like OWL-ViT and SAM into metric 3D space.
No more "flat" AI—this is about spatial intelligence for the next generation of wearables.
Blog🔗 https://t.co/WdOgFPzBBI
Website with links to download: https://t.co/HUdive9EYx
👉@ddetone
@Capsbrr Ah sorry, I thought you meant on Quest cameras, not running on the ML model on Quest hardware. I don’t think this model can run in realtime on Quest. Though it could probably be distilled significantly with further effort and maybe work
Today we release Boxer, a new lightweight approach that lifts open-world 2D bounding boxes to *metric* 3D: https://t.co/5IZ0tPlqvr
Here we show Boxer in action on an egocentric sequence captured from smart glasses:
I implemented it and the ~8 degree gravity correction from GeoCalib made a real difference.
Look at the monitor - on the left (pose heuristic) the box is tilted and doesn't match the screen edges, on the right (GeoCalib) it wraps the monitor much more tightly. The shelf boxes at the top are also cleaner, less overshoot.
Yeah, the improvement is clear.
@_satyam_ai the gravity estimate looks a little bit off. another idea could be to run this per frame and take the global 3D average: https://t.co/KwRqnfa1F8
@yesitsarmin yes, the main limitation is the 2D detector here, but there are tons of better models (SAM3, VLMs) if you have the compute. for very cluttered scenes it doesn't work as well
@BlueAquilae great question! I would not expect it to work well here, we would need to re-train it with a full 9 DoF representation. but feel free to try it out anyway, I'd be curious
@haodongli00 One limitation I found using both of those models is the runtime. For detecting 1000+ text prompts with SAM3 it takes 20+ sec per image. SAM3D also takes ~15 sec per object, so running on large datasets can be expensive. OWLv2 runs at ~30ms and Boxer takes ~20ms