Xipeng Qiu

Researcher of AI. Assistant Professor @Tsinghua_Uni. Working on scalable methods of language and physical models @nature_will_ai.

about 2 months ago

Welcome to try MOSS-TTS-Nano! https://t.co/8i2wzlTarh https://t.co/fsrGdldEgZ

0

7

4

3

613

xpqiu retweeted

ModelScope

@ModelScope2022

about 2 months ago

OpenMOSS drops two model series today: MOSS-VL and MOSS-Video-Preview. 🚀 MOSS-VL: offline multimodal engine with cross-attention architecture, XRoPE, and absolute timestamp injection. 🎬 Video score 65.8, beats Qwen3-VL by +2 pts. VSI-bench +8.3 vs Qwen3-VL-8B-Instruct. 🖼️ Strong on image understanding, OCR, document parsing, and visual reasoning. Two checkpoints: Base (pretrain) and Instruct (SFT). MOSS-Video-Preview: built for real-time streaming video understanding. Cross-attention backbone on Llama-3.2-Vision, native frame-by-frame injection, duplex "listen-speak" switching. 👉 Three checkpoints: Base (pretrain) → SFT (offline instruction) → Realtime-SFT (low-latency streaming, sub-ms TTFT). 🤖 MOSS-VL: https://t.co/n6xAqHpTCm 🤖 MOSS-Video-Preview: https://t.co/jbFkgUv9eG

ModelScope2022's tweet photo. OpenMOSS drops two model series today: MOSS-VL and MOSS-Video-Preview. 🚀

MOSS-VL: offline multimodal engine with cross-attention architecture, XRoPE, and absolute timestamp injection.
🎬 Video score 65.8, beats Qwen3-VL by +2 pts. VSI-bench +8.3 vs Qwen3-VL-8B-Instruct.
🖼️ Strong on image understanding, OCR, document parsing, and visual reasoning.
Two checkpoints: Base (pretrain) and Instruct (SFT).

MOSS-Video-Preview: built for real-time streaming video understanding. Cross-attention backbone on Llama-3.2-Vision, native frame-by-frame injection, duplex "listen-speak" switching.
👉 Three checkpoints: Base (pretrain) → SFT (offline instruction) → Realtime-SFT (low-latency streaming, sub-ms TTFT).

🤖 MOSS-VL: https://t.co/n6xAqHpTCm
🤖 MOSS-Video-Preview: https://t.co/jbFkgUv9eG

1

28

14

3K

xpqiu retweeted

ModelScope

@ModelScope2022

about 2 months ago

Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS. Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration. Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models. 🤖 https://t.co/LewRE4AxEq 🌍 https://t.co/75I7Qmazn0 💻 https://t.co/QF9qwihFT7

ModelScope2022's tweet photo. Say hello to MOSS-TTS-Nano 🚀 0.1B multilingual TTS from https://t.co/YrVk7WDgvQ and OpenMOSS.

Designed for realtime speech generation without a GPU. Runs directly on CPU, keeping the deployment stack simple enough for local demos, web serving, and lightweight product integration.

Part of the MOSS-TTS family alongside the 1.7B and 8B flagship models.

🤖 https://t.co/LewRE4AxEq
🌍 https://t.co/75I7Qmazn0
💻 https://t.co/QF9qwihFT7

8

414

62

428

121K

Who to follow

Ning Ding

@stingning

Zhiyuan Liu

@zibuyu9

Associate Professor @TsinghuaNLP @OpenBMB. Research interests include NLP, KG and social computation.

Pengfei Liu

@stefan_fee

Associate Prof. at SJTU, leading GAIR Lab (https://t.co/Nfd8KmZx3B) Co-founder of Inspired Cognition, Postdoc at @LTIatCMU, Previously FNLP, @MILAMontreal,

about 2 months ago

https://t.co/dOGvSqF6zH

about 2 months ago

(1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window? Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results. Architecture and benchmarks in thread.

Open_MOSS's tweet photo. (1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window?

Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results.

Architecture and benchmarks in thread.

6

16

3

5

2K

0

2

0

179

xpqiu retweeted

about 2 months ago

(1/6) How do you build a video LLM that decouples vision from language — instead of jamming it all into one context window? Our team at OpenMOSS open-sources MOSS-VL, a cross-attention multimodal model with strong video understanding results. Architecture and benchmarks in thread.

6

16

3

5

2K

about 2 months ago

https://t.co/dOGvSqF6zH

0

58

about 2 months ago

MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to advancing visual understanding. To tackle the inherent complexities of video comprehension, our roadmap pursues a systematic scaling strategy along three key dimensions:

xpqiu's tweet photo. MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to advancing visual understanding. To tackle the inherent complexities of video comprehension, our roadmap pursues a systematic scaling strategy along three key dimensions: https://t.co/BUtxe6ejT2

2

10

0

1

303

about 2 months ago

📈 Data Scaling: Curating massive-scale, high-quality datasets to drive robust generalization. 🧠 Parameter Scaling: Expanding model capacity to capture intricate vision-language correlations. ⏳ Context Scaling: Extending temporal horizons to enable over long-form video content

0

1

0

51

xpqiu retweeted

arXiv Sound @ArxivSound

3 months ago

Yitian Gong, Botian Jiang, et al., "MOSS-TTS Technical Report,", https://t.co/xE30tk0IyG

0

17

2

6

934

xpqiu retweeted

DailyPapers

@HuggingPapers

4 months ago

MOSS-Audio-Tokenizer A 1.6B parameter pure Transformer audio tokenizer trained end-to-end on 3M hours of audio. Scales gracefully across speech, sound, and music while enabling the first purely autoregressive TTS to surpass non-autoregressive systems.

HuggingPapers's tweet photo. MOSS-Audio-Tokenizer

A 1.6B parameter pure Transformer audio tokenizer trained end-to-end on 3M hours of audio. Scales gracefully across speech, sound, and music while enabling the first purely autoregressive TTS to surpass non-autoregressive systems. https://t.co/rVXmpXs23C

2

124

16

91

6K

xpqiu retweeted

Zhengfu He @ZhengfuHe

4 months ago

More details are in our paper: https://t.co/A5fNDaJApg Code and replacement layer weights will be open-sourced later. Still writing the docs and testing! https://t.co/uHDvU6IPOM

1

45

3

26

2K

xpqiu retweeted

arXiv Sound @ArxivSound

4 months ago

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu, "MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models," https://t.co/IQ2Yaxhf6q

0

28

5

10

2K

xpqiu retweeted

Wildminder

@wildmindai

4 months ago

WOW! New vid model - MOSS-Video-and-Audio: - native bimodal gen, IT2VA,T2VA; - 32B MoE for sync video & audio in one pass, - SOTA multilingual lip-sync + Sound FX; - 360p/720p, with code, weights & LoRA. Beyond words. Seriously cool. https://t.co/yU4DO0GtCc

6

250

26

191

16K

xpqiu retweeted

4 months ago

We took iconic screenshots from classic cinema and remake the scene using #MOVA. 🎬✨ https://t.co/N5i5vz0fFt Our focus was on seamless end-to-end audio & video generation. #AIVideo #OpenSource #ClassicMovies #GenAI

0

9

2

0

656

xpqiu retweeted

techliteracy @ifree_news

4 months ago

MOVA (MOSS Video and Audio), a foundation model designed to synthesizes video and audio simultaneously https://t.co/3wIfp5WWYK

0

1

0

77

xpqiu retweeted

Wildminder

@wildmindai

4 months ago

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality. https://t.co/LNCkEVuLnG

wildmindai's tweet photo. Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS;
- high-fidelity zero-shot voice cloning, stable long-form gen;
- multilingual;
- lossless reconstruction; fine-grained pronunciation control;
- token-level duration control,
- voice creator, sound effects.
Outstanding quality.
https://t.co/LNCkEVuLnG

1

236

34

264

13K

xpqiu retweeted