Our paper DUET-VLM: Dual-stage Unified Efficient Token Reduction for VLM Training and Inference is accepted to CVPR 2026.
We propose a unified framework for training-aware visual token reduction in VLMs.
📄 https://t.co/4ayO777DAt
🔗 https://t.co/O4lfNKmAHw
#CVPR2026
Our paper was desk rejected @NeurIPSConf! Even before the main deadline! "Non-academic title and abstract" 🙈
Thankfully, @SIGGRAPHAsia was around the corner and a perfect fit for our work on improving robustness of multi-subject multi-attribute layout-guided T2I models!
🧵1/9
This is our group's first work in generation. The @SIGGRAPHAsia team, shepherd, and reviewers helped us a lot! Thank you.
Shivank @x47bsaltydog and Dhruv @sridhruv led this well and taught me a lot about diffusion models in the process! 🎉
Paper: https://t.co/ZlTThPX1H7
🏁 9/9
A large contingent from IIITH’s Computer Vision Lab participated at the Conference on Vision and Pattern Recognition (CVPR) last month in Nashville. Read on about the cutting edge research that was presented and why it’s a big deal in the vision circles. https://t.co/RTFb0YEENe
AMD has announced the open-sourcing of Instella, a fully open 3-billion-parameter LMs trained on AMD Instinct MI300X GPUs.
These models aim to promote collaboration in the AI community by providing access to model weights, configurations, and more. See more from @phoronix ⬇️
Thanks to the organizers (@davmoltisanti +) for an opportunity to share my thoughts at the amazing @CVPR Workshop "What Is Next in Video Understanding" https://t.co/je25r8fwbQ
Slides summarizing some of our work on video and open challenges here: https://t.co/KuyRJ1TpPi
🚨 Introducing Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation.
Given an image pair, it is easier for an MLLM to identify fine-grained visual differences during VQA evaluation than to independently detect and describe such differences 🧵(1/n):
@SamsungIndia I recently faced a frustrating experience with my Galaxy Buds 2 Pro from Samsung India. Despite being told multiple times by customer service that the product is compatible with iOS, I discovered that it lacks full functionality when paired with my iPhone. #Samsung
The product's compatibility issues were not clearly mentioned, leading me to believe this is a defect. If the earbuds can only function as basic Bluetooth devices without access to essential features, shouldn't that be considered a flaw?
Recaps are a wonderful narrative sequence that spark viewers' memories 🧠 to follow the current episode. In our latest @CVPR 2024 work with @rodosingh23 and @sridhruv, we leverage recaps to train models that generate story summaries for TV episodes. 📺
🧵1/8
Given multiple short movie clips, can models generate coherent identity-aware descriptions? 🤔 Turns out, this is a complicated task as it requires linking identities and what they are doing over time. Our @CVPR 2024 work improves this: https://t.co/K4TQFm6GYg
🧵1/7
OpenAI just announced "GPT-4o". It can reason with voice, vision, and text.
The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo.
It will be available for free users and via the API.
The voice model can even pick up on emotion and generate emotive voice.