VideoNet will appear as @CVPR Highlight✨ + 3 workshops TODAY!
Multimodal AI is improving fast, but can it tell apart moves only a domain expert could name?✒️🪀
You probably got a hang of it from clips below😉 — Can models do the same with few-shot examples?
Come find out 👇
🔗 Website: https://t.co/Y9AOF3Rid0
📍 Poster: Fri, Jun 5, 4:00–6:00 PM
🗓️ Workshops (all today, Jun 4):
- KnowledgeMR — 🏆Best Paper Award candidate, talk by @tanushyy
- CVSports
- VidLLMs
Remember action recognition? The days of trying to climb on Kinetics?👻
Announcing VideoNet, a CVPR 2026 Highlight 🎉 which revitalizes action recognition in the VLM era
Explore our data with this fun, interactive demo: https://t.co/W53aBi3QAX
(1/8) 🧵
Thrilled to announce our latest project at @allen_ai@RAIVNLab: WildDet3D
Humans understand objects in 3D effortlessly -- we see a mug on a desk, judge the distance to a parked car, or estimate the height of a building across the street. For CV / Robotics models, this remains surprisingly hard.
We've built great models that each handle a piece of the puzzle: FoundationPose for 6-DoF pose over tabletops, MoGe 2 for accurate metric depth estimation, SAM for 2D segmentation and tracking. But they're fragmented -- each solves one sub-task, none gives you the full picture: where is this object in 3D, how big is it, and how is it oriented?
Monocular 3D object detection is exactly this task -- recovering the full 3D bounding box of any object from a single RGB image. It's the missing link that connects 2D perception to real-world 3D understanding for robotics, AR/VR, and embodied AI.
vehicles
So why hasn't anyone cracked open-world 3D detection? Data.
Existing 3D datasets (Omni3D, COCO3D) cover fewer than 100 categories, locked to driving corridors and indoor rooms. And the annotation methods -- BEV labelling, point cloud labelling -- fundamentally don't scale to in-the-wild scenes where you don't have LiDAR or a well-reconstructed point cloud. And objects are much more diverse in size/pose compared with vehicle and furniture.
To tackle this: We designed a human-in-the-loop pipeline to change this. We build complex pseudo-3D box generators using different algorithms/models. Then, 1700+ human annotators from Prolific select the best candidate and verify quality.
Along with thousands of annotators for several months, we got the result: WildDet3D-Data -- 1M total images, 13.5K categories of objects, with 100k of all human-verified 3d detection images. That's 138x more category coverage than Omni3D. Street food carts, violins, traffic cones, sculptures -- objects no 3D dataset has ever covered.
With this data, we trained WildDet3D -- a single geometry-aware architecture built on SAM 3 and LingBot-Depth that unifies every way you'd want to interact with a 3D detector:
- Text: "find all chairs"
- Box prompt: click a 2D box, get its 3D box (geometric, one-to-one)
- Exemplar prompt: draw one box, find all similar objects (one-to-many)
- Point prompt: click on an object
And when you have extra depth -- LiDAR, stereo, anything -- just pass it in. The model fuses it and gets substantially better: +20.7 AP on average. No depth? It works fine without it.
Results on our new in-the-wild benchmark (WildDet3D-Bench, 700+ open-world categories): 22.6 AP text / 24.8 AP box -- up from 2.3 AP for the previous best. With depth: 41.6 AP text / 47.2 AP box. Also SOTA on Omni3D (34.2 AP text / 36.4 AP box) with 10x fewer training epochs, and strong zero-shot transfer to Argoverse 2 and ScanNet (40.3 / 48.9 ODS).
Today, we release SERA-32B, an approach to coding agents that matches Devstral 2 at just $9,000. It is fully open-source and you can train your own model easily - at 26x the efficiency of using RL.
Paper: https://t.co/aeD6T2WW3O
Here’s how 🧵
Total TDS/INT
Michael Penix Jr. 33/7
Stetson Bennett 26/6
Total YDS
Penix Jr. 4440
Bennett 3609
Total YDS/Game
Penix Jr. 370
Bennett 278
Passer Rating
Penix Jr. 155.5
Bennett 154.2
Adjusted Yards/Attempt
Penix Jr. 9.2
Bennett 9.0
If you’re unfamiliar, Chargers Pro Bowl OG Kris Dielman tried flying home from NY after suffering a concussion. He had a seizure, they had to do an emergency landing, and he never played again. They’re lucky he lived.