🎉 Thrilled to announce that our paper "Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models" received the CVPR Compute Gold Star at #CVPR2026!
Congrats to our co-authors!!🙌
https://t.co/mCGCXz2xrg
Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models.
It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction.
Paper:
https://t.co/9jvRW0nI8B
🔥LLaVA-OneVision-2.0 Open Sourced🔥
LLaVA-OneVision series @lmmslab now upgrades to 2.0 with its key advance on *codec-stream tokenization*, which treats highly dynamic video as a continuous bit-cost stream
- Tech Report: https://t.co/pFo2fGYj2M
- Code: https://t.co/JvRzu96rJ1
We propose HATCH🐣, a human-inspired training framework for multi-image spatial reasoning in VLMs 🐤
HATCH improves multi-image spatial reasoning ability while preserving single-image reasoning capabilities 🐓
📚️https://t.co/02Ry5iGmn3
🎉 Excited to share that our paper has been accepted to CVPR 2026 and is now available on arXiv!
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
🔗 https://t.co/eCxMdCKHLn
#CVPR2026#arXiv [1/N]
Humans can see in high-res, high-FPS in real-time. Why can't VLMs?
Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos.
📄 https://t.co/GhbWZwMAg7
🌐 https://t.co/mEJ991MAIR
🤗 https://t.co/FOfc2QRThi
(1/n)🧵
Grounding lets vision-language models do more than describe—they can point to where a robot should grasp, which button to click, or which object to track across video frames.
Today we're releasing MolmoPoint, a better way for models to point. 🧵