Proud to share our lab’s @MMLabNTU work Log-linear Sparse Attention (LLSA) - a trainable sparse attention mechanism that reduces attention complexity from O(N²) to O(N log N), making diffusion transformers much more efficient.
Also, special shout-out to the first author @zhouyifan1107 for presenting the poster in full costume - truly above and beyond. The level of dedication is impressive! 👏
#CVPR2026 #DiffusionModels #EfficientAI #SparseAttention
Sa2VA is the the first unified model for dense grounded understanding of both images and videos. It combines the SAM-2 and MLMM models to enable a wide range of image and video tasks in minimal one-shot instruction tuning.
Introducing our Sa2VA. 🔥
Project Page: https://t.co/6RFu459ks2
GitHub: https://t.co/4VPQxABO33
Huggingface Demo: https://t.co/yyWI7WvQXh (Feat: @huggingface@fffiloni )
We provide various ways to get a quick start. Have fun~🥳
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗
The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶
Sa2Va 🔥 a unified model for dense grounded understanding of images & videos released by Bytedance.
Model: https://t.co/yjTV7yGjiI
Paper: https://t.co/ntvcGrtJmg
✨ 1B/4B/8B
✨ Based on InternVL, used Qwen2 & 2.5, InternLM as language part.
✨ unifies text, images, and videos into a shared token space for seamless multimodal interactions
mmlab-ntu presents Open-Vocabulary SAM
Segment and Recognize Twenty-thousand Classes Interactively
paper page: https://t.co/QqODebOcPp
Open-Vocabulary SAM extends SAM's segmentation capabilities with CLIP-like real-world recognition, while significantly reducing computational costs. It outperforms combined SAM and CLIP methods in object recognition on the COCO open vocabulary benchmark
Happy to share that our survey has been accepted by T-PAMI. @HenghuiDing@HarborYuan
We present the first comprehensive survey on open-vocabulary learning: detection, segmentation, video, 3D analysis, etc.
Paper: https://t.co/uju0cI37oy
Github: https://t.co/9BivSt22uN
Glad to share one research work during my post-doc study with @HarborYuan@ccloy.
Name: OMG-Seg: Is One Model Good Enough For All Segmentation?
Arxiv: https://t.co/uO1UFFcRjl
Project Page: https://t.co/3tflOkGmOR
Code: https://t.co/Xji2qTcLti
Demo: https://t.co/QBkW7YWHCJ
Thank you @AK for sharing our work!
We are excited to introduce our Open-Vocabulary SAM, which fuses the knowledge from two foundation models (CLIP and SAM) into a unified architecture.
Website: https://t.co/08w7K1s4qU
Code: https://t.co/I4ysfJi2V5
Paper: https://t.co/sgSNvCvBup
mmlab-ntu presents Open-Vocabulary SAM
Segment and Recognize Twenty-thousand Classes Interactively
paper page: https://t.co/QqODebOcPp
Open-Vocabulary SAM extends SAM's segmentation capabilities with CLIP-like real-world recognition, while significantly reducing computational costs. It outperforms combined SAM and CLIP methods in object recognition on the COCO open vocabulary benchmark
Maintaining anonymity on the internet has become increasingly challenging. @MIT@techreview explores China's shift away from online anonymity, highlighting research conducted by #iSchoolUI PhD student @kyriezz78. Check out the full article: https://t.co/xW13oghxnK
Introducing EdgeSAM, the first SAM variant that can run at over 30 FPS on an iPhone 14 with minimal compromise in performance.
Code, models, and Hugging Face demo are available!
arXiv: https://t.co/kPt7fnlCvC
Project page: https://t.co/6w60RZQgnG
We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: @Google’s largest and most capable AI model.
Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵 https://t.co/mwHZTDTBuG