In our CVPR'25 paper, we introduced the ViCaS dataset which contains 20,000+ videos with both detailed video captions, as well as pixel-precise masks for selected objects with phrase-grounding (1/4)
For those of you looking to extend their Video-LLMs with spatial intelligence capability, this dataset is a potential game-changer. ViCaS is the largest, human-annotated video dataset that provides both captions as well as grounded segmentation masks (3/4)
@giffmana Never said it was a perfect solution😅
Although given PyTorch's popularity, the average grad student these days is probably quite familiar with PyTorch API mechanics.
@AljosaOsep@Pandoro_o I think the the confusion arose because the deadline was written as 15 Nov, 2AM CT (which is where the conf venue is), and people just assumed it was until the end of the day according to Pacific time without thinking about the timezone much.
I'm happy to share our latest work "MaskBit: Embedding-free Image Generation via Bit Tokens" which tackles class-conditional image generation with sota results!
➡️Project page: https://t.co/ET7XipcTzN
➡️preprint: https://t.co/tLdRWeIKdt
📣📣 Hiring a PhD-Intern 📣📣
Work with me on Dynamic 3D Gaussians at the Meta Boston office for 6 months in summer 2025!
Apply here: https://t.co/VnIPQX6f3p
+ write me your questions / link your most relevant work via email or twitter.
Check out our work on fine-tuning of image-conditional diffusion models for depth and normal estimation.
Widely used diffusion models can be improved with single-step inference and task-specific fine-tuning, allowing us to gain better accuracy while being 200x faster!⚡
🧵(1/6)
It's been a pleasure to work on this with @GerardPonsMoll1
I think we found a really effective formulation for training large-scale strong pose and shape models.
Here are some more qualitative results on tough, in-the-wild YouTube dance videos.
Happy to share that #PointVOS has been accepted to #CVPR2024 🎉 It was a great collaboration with Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe🥳 I will be in Seattle next week to present our poster😊
📜Paper: https://t.co/46rfuNDsms
🌐 Website: https://t.co/4kYBYOJtpm
@eric_brachmann@david_picard@jon_barron@CSProfKGD@dimadamen@taiyasaki I don't think it's obvious to everyone in the by-now-very-large CVPR reviewer pool. This, and other common examples would have to be explicitly discussed in the reviewer guidelines for this motion to have the desired effect.
@david_picard@jon_barron@CSProfKGD@dimadamen@eric_brachmann@taiyasaki An obvious gray area is if the previous work releases model checkpoints and inference code only. Should authors be expected to write the training code and reproduce the results in order to compare their work to previous work trained in a different setting?
@gabriberton This is already the case for some code-bases I've recently worked with. The image means and stds are saved as buffers with the model checkpoint and applied inside the forward pass.
@gabriberton That's a big qualifier IMO. The computation graph will be largely shared if the inputs are applied to an encoder/backbone network, which is often the case. In case it is partially shared, is PyTorch smart enough to release only that part which is not needed by other losses?
As one journey ends, another beings! For the next phase in life, I've moved to the Bay Area and taken up a Research Scientist position at @BytedanceTalk where I'll continue to work on exciting research problems related to video understanding.
Successfully defended my PhD at the @RWTHVisionLab! I'm thankful to my supervisor, colleagues, family members and to God for this incredible 5-year experience! Aside from the professional/research experience, I'll cherish the personal bonds I made here for a long time to come.