OpenMOSS

Verified account

@Open_MOSS

OpenMOSS is an open research community aimed at building artificial general intelligence. Discord 👇

Joined January 2025

29 Following

288 Followers

39 Posts

Pinned Tweet

about 2 months ago

(1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window? Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results. Architecture and benchmarks in thread.

Open_MOSS's tweet photo. (1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window?

Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results.

Architecture and benchmarks in thread.

6

16

3

5

2K

Open_MOSS retweeted

@MosiAI_Official

28 days ago

Open-source video should be easy to run, adapt, and build into products. That’s what MOVA is designed for. MOVA-360p has reached 142K total downloads on Hugging Face, with 88,362 downloads in the last month. Developers get open weights, inference code, training pipelines, LoRA fine-tuning scripts, Apache-2.0 licensing, Diffusers support, and Safetensors. Now, with DiffSynth Studio support for MOVA-360p and MOVA-720p, teams can use MOVA across both inference and training workflows. Hugging Face: https://t.co/xDnEJ8g0eY GitHub :https://t.co/CFNm2aUHxN DiffSynth Studio: https://t.co/HxQ2kTKS5J

MosiAI_Official's tweet photo. Open-source video should be easy to run, adapt, and build into products.
That’s what MOVA is designed for.

MOVA-360p has reached 142K total downloads on Hugging Face, with 88,362 downloads in the last month.

Developers get open weights, inference code, training pipelines, LoRA fine-tuning scripts, Apache-2.0 licensing, Diffusers support, and Safetensors.
Now, with DiffSynth Studio support for MOVA-360p and MOVA-720p, teams can use MOVA across both inference and training workflows.

Hugging Face: https://t.co/xDnEJ8g0eY
GitHub :https://t.co/CFNm2aUHxN
DiffSynth Studio: https://t.co/HxQ2kTKS5J

0

9

3

1

369

about 2 months ago

Welcome to try MOSS-TTS-Nano! https://t.co/8i2wzlTarh https://t.co/fsrGdldEgZ

@ModelScope2022

about 2 months ago

Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS. Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration. Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models. 🤖 https://t.co/LewRE4AxEq 🌍 https://t.co/75I7Qmazn0 💻 https://t.co/QF9qwihFT7

ModelScope2022's tweet photo. Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS.

Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration.

Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models.

🤖 https://t.co/LewRE4AxEq
🌍 https://t.co/75I7Qmazn0
💻 https://t.co/QF9qwihFT7

ModelScope2022's tweet photo. Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS.

Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration.

Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models.

🤖 https://t.co/LewRE4AxEq
🌍 https://t.co/75I7Qmazn0
💻 https://t.co/QF9qwihFT7

ModelScope2022's tweet photo. Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS.

Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration.

Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models.

🤖 https://t.co/LewRE4AxEq
🌍 https://t.co/75I7Qmazn0
💻 https://t.co/QF9qwihFT7

8

415

62

429

121K

0

7

4

3

593

about 2 months ago

(6/6) MOSS-VL is live. Two checkpoints (Base + Instruct), Apache 2.0. 🎮 Try it: https://t.co/8uDYMECBWp 🤗 https://t.co/MJDzdaiAF4 🐙 https://t.co/smKnsJnlSY 🇨🇳 https://t.co/VIdIFxCyKZ 📄 Arxiv: soon From OpenMOSS. Bookmark for later.

Open_MOSS's tweet photo. (6/6) MOSS-VL is live. Two checkpoints (Base + Instruct), Apache 2.0.

🎮 Try it: https://t.co/8uDYMECBWp
🤗 https://t.co/MJDzdaiAF4
🐙 https://t.co/smKnsJnlSY
🇨🇳 https://t.co/VIdIFxCyKZ
📄 Arxiv: soon

From OpenMOSS. Bookmark for later. https://t.co/ba9wgM6uoj

0

1

0

0

226

about 2 months ago

(1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window? Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results. Architecture and benchmarks in thread.

Open_MOSS's tweet photo. (1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window?

Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results.

Architecture and benchmarks in thread.

6

16

3

5

2K

about 2 months ago

(5/6) We propose XRoPE — Cross-attention RoPE — mapping text tokens and visual patches into a unified 3D space: time (t), height (h), width (w). 1. Injected into vision Key + text Query for cross-modal alignment 2. Value left untouched to preserve feature fidelity

Open_MOSS's tweet photo. (5/6) We propose XRoPE — Cross-attention RoPE — mapping text tokens and visual patches into a unified 3D space: time (t), height (h), width (w).

1. Injected into vision Key + text Query for cross-modal alignment
2. Value left untouched to preserve feature fidelity https://t.co/YtiiDbKJf0

0

0

0

0

184

about 2 months ago

(4/6) The biggest mistake video LLMs make: they treat frames as a sequence of images, not a sequence in time. MOSS-VL wraps every frame with special tokens — <|time_start|>1.2 seconds<|time_end|> — anchoring it in absolute time. Grounded in absolute time, not frame indices.

Open_MOSS's tweet photo. (4/6) The biggest mistake video LLMs make: they treat frames as a sequence of images, not a sequence in time.

MOSS-VL wraps every frame with special tokens — <|time_start|>1.2 seconds<|time_end|> — anchoring it in absolute time.

Grounded in absolute time, not frame indices. https://t.co/3OPHTZUB1e

0

0

0

0

165

about 2 months ago

(3/6) We benchmarked MOSS-VL across 30+ multimodal tasks vs Qwen2.5-VL and Qwen3-VL: 1. 📹 Video Understanding: 65.8 (+2 vs Qwen3-VL) 2. 📄 OCR: 83.9 3. 🎯 VSI-bench: +8.3 over Qwen3-VL-8B-Instruct Consistently first or second across the board.

Open_MOSS's tweet photo. (3/6) We benchmarked MOSS-VL across 30+ multimodal tasks vs Qwen2.5-VL and Qwen3-VL:

1. 📹 Video Understanding: 65.8 (+2 vs Qwen3-VL)
2. 📄 OCR: 83.9
3. 🎯 VSI-bench: +8.3 over Qwen3-VL-8B-Instruct

Consistently first or second across the board. https://t.co/H4Ko78MIvS

0

1

0

1

177

about 2 months ago

(2/6) Hot take: most video LLMs are wired backwards. They jam visual tokens straight into the LLM context, forcing one model to do both perception and reasoning at once. But here's the fix: MOSS-VL uses cross-attention to keep the two in separate spaces, talking only when needed.

Open_MOSS's tweet photo. (2/6) Hot take: most video LLMs are wired backwards.

They jam visual tokens straight into the LLM context, forcing one model to do both perception and reasoning at once.

But here's the fix: MOSS-VL uses cross-attention to keep the two in separate spaces, talking only when needed.

0

0

0

0

147

about 2 months ago

@ModelScope2022 Thanks for sharing!

1

1

0

0

83

2 months ago

📄 Paper: https://t.co/aRhBT6nnKr 🌐 Project: https://t.co/RjZzcmkoDk 💻 GitHub: https://t.co/hY2f0etJzV 🤗 HF Models & Data: https://t.co/KZtIYdXORO

0

1

0

1

124

2 months ago

🚨AI can learn scientific taste. 🔬🤖 Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists. We are no longer just building AI that automates the execution of science. We are building AI that can automate the direction of science. Scientific taste is no longer a human monopoly. We have open-sourced everything. Come build the future of AI scientists with us! #AutoResearch #AI #Agent #VibeResearch

Open_MOSS's tweet photo. 🚨AI can learn scientific taste. 🔬🤖

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact.

However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem.

For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas.

For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact.

Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

We are no longer just building AI that automates the execution of science. We are building AI that can automate the direction of science. Scientific taste is no longer a human monopoly. We have open-sourced everything. Come build the future of AI scientists with us!

#AutoResearch #AI #Agent #VibeResearch

1

7

2

2

215

2 months ago

@m4zas24 Hi, you can try here: https://t.co/50N3RVZHx7 Event tags is on the way.

0

0

0

0

25

4 months ago

🚀 The MOSS-TTS Family is here. From zero-shot cloning to real-time VoiceAgents, we have released our most powerful suite of audio models yet. The Lineup: MOSS-TTS Flagship: The industry's best zero-shot voice cloning. Features precise control over duration & Pinyin, capable of generating 1 hour of speech. MOSS-TTSD-v1.0: A new standard for dialogue generation. Comprehensive optimization for conversational scenes and small languages. Best-in-class performance in all evaluations. MOSS-VoiceGenerator: One-shot timbre generation. Create voices with a single sentence and complex instruction handling. MOSS-TTS-Realtime: Built for the next era of VoiceAgents. Synthesis starts in just 2 characters for instant response. MOSS-SoundEffect: Text-to-Audio sound effects to expand your creative toolkit. 🔥 Try it now: https://t.co/sUS7vDjdJk 💻 Deploy (GitHub): https://t.co/h4pAco5nwk 🔌 API Docs: https://t.co/bcWWY31LAO Welcome to our demo. The era of 'childhood' for TTS is over. #MOSS #AI #TextToSpeech #TTS #OpenClaw #Agent #OpenMOSS #Opensource #VoiceAgent

7

30

6

9

4K

2 months ago

MOVA is here: https://t.co/N5i5vz0fFt

2 months ago

Mova，video and audio sync based wan 2.2 i2v a14B

2

184

23

78

27K

1

1

1

0

314

3 months ago

Our bench can also test image edit models! It's a truly unified multimodal generative reasoning benchmark testing video models, image edit models and VLMs. Results on mini test set: (6/6)

Open_MOSS's tweet photo. Our bench can also test image edit models! It's a truly unified multimodal generative reasoning benchmark testing video models, image edit models and VLMs.

Results on mini test set:
(6/6) https://t.co/QhbOveClPO

0

0

0

0

242

3 months ago

CVPR2026 🎉 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm 🌟We use video frames as a unified medium for text and vision reasoning. 🤯 🔥Video model (Sora-2) beats GPT-5 by 10% on Eyeballing Puzzles! 🧵https://t.co/3ChEUJsW6K (1/6) #CVPR2026 #seedance2 #Multimodal #VideoGeneration #Sora2 #Reasoning #LLM #AI

5

17

11

0

2K

3 months ago

What about text-heavy logic? Sora-2 takes a prompt + image, and generates a video "writing" the step-by-step solution. It even reads the answer via audio! 🔊 Staggering results: 🎯 MATH: 92% 🎯 MMMU: 69.2% (5/6)

0

0

0

0

214

3 months ago

Sora-2 solves complex visual puzzles (color filling, shape drawing) by understanding symmetry, gradients, and composition. On Visual-Shape tasks, Sora-2's inductive reasoning actually matches Claude 3.5 Sonnet! 🎨🧩 (4/6)

Open_MOSS's tweet photo. Sora-2 solves complex visual puzzles (color filling, shape drawing) by understanding symmetry, gradients, and composition.

On Visual-Shape tasks, Sora-2's inductive reasoning actually matches Claude 3.5 Sonnet! 🎨🧩
(4/6) https://t.co/AZMfUotCRr

0

0

0

0

183

3 months ago

We introduce VideoThinkBench to test this. On "Eyeballing Puzzles", Sora-2 reasons by simulating light reflection and manipulating geometry. Result? It outperforms SOTA VLMs and scores 10% higher than GPT-5! 📈🧩 All code and data are open-sourced: https://t.co/nJYEcVqxAP (3/6)

Open_MOSS's tweet photo. We introduce VideoThinkBench to test this.

On "Eyeballing Puzzles", Sora-2 reasons by simulating light reflection and manipulating geometry.

Result? It outperforms SOTA VLMs and scores 10% higher than GPT-5! 📈🧩

All code and data are open-sourced: https://t.co/nJYEcVqxAP
(3/6)

0

0

0

0

189

3 months ago

Current LLM/VLM paradigms ("Thinking with Text/Images") have limits: static images lack dynamics, and split modalities hinder understanding. Our fix: Thinking with Video. Video frames as a unified medium to draw/write reasoning steps! ✍️🎥 Project: https://t.co/LGZgDIpxVW (2/6)

Open_MOSS's tweet photo. Current LLM/VLM paradigms ("Thinking with Text/Images") have limits: static images lack dynamics, and split modalities hinder understanding.

Our fix: Thinking with Video. Video frames as a unified medium to draw/write reasoning steps! ✍️🎥

Project: https://t.co/LGZgDIpxVW
(2/6)

0

0

0

0

197

Last Seen Users on Sotwe

Trends for you

Most Popular Users