๐ Just Released: A Survey on Token Compression for MLLMs
How do we efficiently "reduce" the number of tokens in MLLMs?
Our work introduces: 1๏ธโฃ A taxonomy based on Where to Compress (Encoder/Projector/LLM). 2๏ธโฃ A deployment roadmap on How to Select the right algorithm.
[1/n]
Super excited to introduce PaperBanana ๐! (PKU x Google Cloud AI)
As AI researchers, we often spend way too much time crafting diagrams and plots instead of focusing on the ideas ๐คฏ. To rescue us from this burden, we built an Agentic Framework to auto-generate NeurIPS-quality paper illustrations!
๐ Paper: https://t.co/2NbQeEhzMv
๐ Page: https://t.co/05dKkjVs7f
Key Features:
๐ Human-like Workflow: Retrieve ๐ -> Plan ๐ -> Style ๐จ -> Render ๐ผ๏ธ -> Critique ๐. This ensures both academic fidelity and aesthetics.
๐ Versatile: Supports both illustrative diagrams and statistical plots.
๐ Polishing: Also effective for polishing existing human-drawn diagrams.
Here are some example diagrams and plots generated by our PaperBanana:
๐ Just Released: A Survey on Token Compression for MLLMs
How do we efficiently "reduce" the number of tokens in MLLMs?
Our work introduces: 1๏ธโฃ A taxonomy based on Where to Compress (Encoder/Projector/LLM). 2๏ธโฃ A deployment roadmap on How to Select the right algorithm.
Video understanding isn't just recognizing โit demands reasoning across thousands of frames.
Meet Long-RL๐ Highlights:
๐ง Dataset: LongVideo-Reason โ 52K QAs with reasoning.
โก System: MR-SP - 2.1ร faster RL for long videos.
๐ Scalability: Hour-long videos (3,600 frames) RL on a single node (8รA100s).
๐ผ๏ธ๐๐ต RL training for video, text, audio โ works with VILA, Qwen series, and image/video generation models ๐จ๐ฌ
๐ Paper: https://t.co/vbU5n0w0go
๐ฅ Demo: https://t.co/3wCv5TJsTa
๐ป Code: https://t.co/K9U4fl3HHc
๐ Happy to share that our TimeChat-Online work has been accepted to ACM Multimedia 2025!
๐ Check out the project page: https://t.co/ts7IeGz6rc
โญ๏ธ Star our repo if you like it: https://t.co/ue0jQDh42p
๐ค #VideoLLMโ๐ฌ #StreamingAIโ๐ #ACMMM2025
Excited to share our new survey on the reasoning paradigm shift from "Think with Text" to "Think with Image"! ๐ง ๐ผ๏ธ
Our work offers a roadmap for more powerful & aligned AI. ๐
๐ Paper: https://t.co/ZfaT9CCYuW
โญ GitHub (400+๐): https://t.co/YLRaGvB70q
๐ฅ Check out our new demo video that shows how TimeChat-Online makes real-time video understanding efficient, fun, and intuitive!
๐ Demo: https://t.co/6lajeVFbG5
๐ Project: https://t.co/ts7IeGz6rc
๐ Try it out and let us know what you think!
#StreaimingVideo#MultimodalAI
โฐ We introduce Reinforcement Pre-Training (RPT๐)
โ reframing next-token prediction as a reasoning task using RLVR
โ General-purpose reasoning
๐ Scalable RL on web corpus
๐ Stronger pre-training + RLVR results
๐ Allow allocate more compute on specific tokens
MiMo-VL technical report, models, and evaluation suite are out!
๐ค Models: https://t.co/Qb2zYTVfzS (or RL)
Report: https://t.co/AqTpy0r2bI
Evaluation Suite: https://t.co/s0rU38DoyU
Looking back, it's incredible that we delivered such compact yet powerful vision-language models in under six months.
Here are my key takeaways from our journey:
Reasoning is now essential for VLMs. Adding long chain-of-thought data to our training produced clear performance gains across all benchmarks. What's fascinating is watching our model actually examine different parts of images, checking various details before working through its reasoning to reach an answer.
Mixed reward learning was our biggest challenge and most inspiring discovery. We saw comprehensive improvements on almost every task with objective rewards like document perception, visual grounding, and multimodal math. MiMo-VL-RL is now the best open-source VLM on the InfoVQA test set. But subjective rewards like human preference data proved much trickierโmodels learn to game these signals surprisingly quickly. Finding the right balance is truly an art.
We're committed to reproducible VLM research. Throughout development, we experienced firsthand how difficult it is to reproduce results from other papers. Different prompts, temperature settings, and evaluation processes make fair comparisons nearly impossible. That's why we're releasing our complete evaluation suite covering 50+ tasks, built on lmms-eval, with fully reproducible results. We might be the first to do this comprehensively, and we hope it helps advance the field by making research more transparent and comparable.
๐ New Paper: Pixel Reasoner ๐ง ๐ผ๏ธ
How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself?
We introduce Pixel Reasoner, the first open-source framework that enables VLMs to โthink in pixel spaceโ through curiosity-driven reinforcement learning.
Current VLMs reason only in text โ even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn.
๐ So we ask:
What if we could make VLMs "show their work" by reasoning directly in the pixel space?
Inspired by GPT-o3โs "think-in-image" ability, we propose a framework where VLMs use interactive visual operations โ zoom, select-frame, highlight โ to reason through complex visual inputs.
To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning
โจ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks:
๐ 84% on InfographicsVQA
๐ง 84% on V* benchmark
๐งฉ 74% on TallyQA-Complex
It also achieves strong accuracy of 68% on MVBench (a video benchmark).
Website: https://t.co/3YUxaIJmIv
Paper: https://t.co/CHYukmu5fB
Code: https://t.co/0mQOfXbKpM
Demo: https://t.co/AWDNoffEz8 (coming soon)
๐ Delighted to share that our paper GenS has been accepted to ACL 2025 Findings
๐ค Itโs been a real pleasure working with my wonderful collaborators! #ACL2025#Multimodal#VideoLLM
Code: https://t.co/lPmrF9Gdl6
Dataset: https://t.co/3TddlG3ccs
๐ข Introducing GenS: Generative Frame Sampler for Long Video Understanding!
๐ฏ It can identify query-relevant frames in long videos (minutes to hours) for accurate VideoQA
๐Project page: https://t.co/rXMvB06fAz
(4/n)
๐ Performance Highlights:
โข StreamingBench: 56.6 accuracy with 82.6% token reduction (new SOTA)
โข OVO-Bench: 45.6 accuracy with 84.8% tokens dropped (new SOTA)
โข Long video benchmarks (MLVU, VideoMME, LongVideoBench.): up to 85% drop with no performance loss
(3/n)
๐ฅ DTD can be directly plugged into the Qwen2.5-VL series without training.
โ On VideoMME (30โ60 mins), Qwen2.5-VL-7B w/ DTD boosts accuracy by 5.7 points while dropping 84.6% of tokens
๐ Longer videos tolerate even higher drop rates: up to 97.5% while maintaining acc!
๐ข Introducing GenS: Generative Frame Sampler for Long Video Understanding!
๐ฏ It can identify query-relevant frames in long videos (minutes to hours) for accurate VideoQA
๐Project page: https://t.co/rXMvB06fAz