๐ข [CVPRโ26] Can we learn to detect, segment, and track every object in a video without human supervision?ย
Yes, we introduce VideoCUPS, the first unsupervised video panoptic segmentation (VPS) method: 1. Get pseudo-labels from monocular videos. 2. Train a VPS model on them.
When fine-tuned with just 10% of labels, VideoCUPS already matches a fully supervised model trained on all Cityscapes-VPS labels, and outperforms the DINO-initialized baseline significantly.
[1/3] Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities
by @dustin_carrion*, Maria Santos-Villafranca*, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, and @schaub_simone
[2/3] KARMMA is a multimodal-to-multimodal distillation framework for egocentric action recognition that does not require modality-aligned data and supports any subset of modalities at inference. It produces a lightweight student robust to missing modalities without retraining.
In-context learning suggests that a model has learned versatile representations. What if we use in-context learning itself as a training task for visual representations?
๐ฃ Introducing ๐๐๐๐: ๐๐ถ๐ป๐ฒ๐ฎ๐ฟ ๐๐ป-๐๐ผ๐ป๐๐ฒ๐ ๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด โจ @CVPR 2026 Oral โจ
๐๐๐๐ trains on videos without manual annotation.
Key idea: An optimal linear mapping that predicts dense cues (e.g. depth, flow), estimated on one video frame, should also predict the corresponding cues of another frame from the same video.
This yields compelling results on dense vision tasks: video object segmentation, (zero-shot) semantic segmentation and surface normal estimation.
Paper, code, models and demo: https://t.co/Xn2SgskKQ8
Joint work with @ma_sundermeyer, Hidenobu Matsuki, David Joseph Tan and @fedassa (and special thanks to David and Federico for hosting my research visit at Google).
#cvpr2026 @Google@MunichCenterML@tumcvg@TU_Muenchen
โจ#CVPR2026 Oral โจ
A tale of a failed experiment: what if you fine-tune DINOv2 on sparse keypoints, beat every benchmark, only to discover it performs worse than the original frozen model on novel keypoints?
๐MARCO closes this gap: a unified model for generalisable correspondences
https://t.co/vE62YiTVfd
What if a model could learn dense semantic matches from just a handful of annotated landmarks, while still generalizing to unseen keypoints and categories โ and running 10ร faster than diffusion-based approaches?
MARCO is selected as an Oral at #CVPR2026! A unified model for generalizable semantic correspondence, built on DINOv2โญ๏ธ
๐ Try our model: https://t.co/Zvt4QTRVJQ
โจ As a first-year PhD student, I used to wonder what it must feel like to have a paper selected as an Oral at #CVPR. Today, Iโm experiencing that feeling twice!
Iโm beyond happy to share that both of my first-author papers have been selected as #Oral at #CVPR2026 ๐
๐ฅ Can in-context segmentation emerge directly from frozen DINOv3 features?
At #CVPR2026, we present INSID3: Training-Free In-Context Segmentation with DINQv3 โ a collaboration between PoliTo, TU Darmstadt and TU Munich.
A training free approach that generalizes from object-level to part-level and personalized segmentation, across natural, medical, underwater, and aerial domains
Check it out: https://t.co/AaMRbfjLyn