DeepGen 1.0
A lightweight 5B unified multimodal model that outperforms 80B+ giants like HunyuanImage by 28% on WISE and Qwen-Image-Edit by 37% on UniREditBench—proving scale isn't everything
🚀 Pref-GRPO: A pairwise preference-based GRPO method that tackles reward hacking for T2I models
🎨 UniGenBench: A unified benchmark providing comprehensive, fine-grained evaluation for T2I models across 27 dimensions & 20 scenarios
🤗Leaderboard: https://t.co/bTUshFNQeI
SEAgent autonomously learns through experiential feedback, evolving from specialists to generalists. Key components include a World State Model and Curriculum Generator.
Read the paper: https://t.co/jSzI9IHt8P
Try the model: https://t.co/Ftr1KKEK8i
Nvidia's got something new
UnifiedReward-Think is here: a multimodal CoT reward model for both visual understanding and generation
https://t.co/k3z5LARosv
🎉 Excited to introduce IDArb! 🎉
Our method can predict plausible and 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 geometry and PBR material for 𝗮𝗻𝘆 𝗻𝘂𝗺𝗯𝗲𝗿📷 of input images under 𝘃𝗮𝗿𝘆𝗶𝗻𝗴 𝗶𝗹𝗹𝘂𝗺𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀☀️ !
Webpage: https://t.co/GvfyvbEq25
🚀 We’re excited to announce the release of InternLM-XComposer2.5-OmniLive (IXC2.5-OL), a comprehensive multimodal system designed for long-term streaming video and audio interactions. This fully open-sourced project delivers functionality similar to Gemini 2.0 Live Streaming and OpenAI Her, with standout features including:
🎥 Chat with Streaming Video & Audio
💾 Long-Term Memory for recalling past video experiences
🏆 Competitive Performance across various video and audio perception benchmarks
📄 Paper: https://t.co/oiBfU74lR4
💻 Code: https://t.co/hP881kVnlM
📦 Models: https://t.co/t7hGEuj9Qx
✨ Immerse yourself in multimodal interaction and create your own app today!
#Gemini2 #OpenAI #ChatGPTAdvancedVoice
InternLM-XComposer-2.5-OmniLive🔥 a specialized generalist multimodal system for streaming video and audio interactions by @intern_lm.
Model: https://t.co/dSqEg2mnK8
✨ Apache 2.0, but a form is required for a commercial license
😻Fine-Grained Visual Attributes for GenAI😻
#NeurIPS2024 🍎FiVA🍊 is a fine-grained visual attributes dataset and a framework that decouples different visual attributes for GenAI
- Project: https://t.co/hhSlc7PFQm
- Code: https://t.co/Ggji0AluDN
- Data: https://t.co/LgRjvcShl1
We have released SAM2Long, a training-free enhancement to SAM 2 for long-term video segmentation
🔥 Less error accumulation facing occlusion/reappearance.
⚡️ A training-free memory tree for dynamic segmentation paths, boosting resilience efficiently.
🤯 Significant improvements over SAM2 across 24 head-to-head comparisons on SA-V and LVOS.
Technical Report: https://t.co/jI0WbJDSHr
Github: https://t.co/nxc1WoMVoO
Homepage: https://t.co/zhx7tQuG2R
#AIML #VideoSegmentation #SAM2Long #ComputerVision
🚀Check out VideoVista, our comprehensive video-LMMs evaluation benchmark! We've assessed 33 video Video-LMMs across 27 tasks. Highlights include the latest GPT-4o-Mini, ranked third, and InternLM-XComposer-2.5, the top-performing open-source model.
More: https://t.co/Ey0MIzXIlT
Large Vision-Language Models (LVLMs) perform ideally on the understanding of single-page documents like DocVQA, ChartQA.
Here remains an open question🧐: Can LVLMs handle long documents well?
We introduce MMLongBench-Doc!
🌐 Project Page: https://t.co/rdGoATeCW3
🧵(1/7)