Qwen3-TTS Family is now Open Sourced!
- Supports 10 languages
- Has voice cloning and design abilities
- 5 models (0.6B and 1.7B)
- High compression (12Hz)
Highlights:
- Qwen3‑TTS‑VoiceDesign beats MiniMax‑Voice‑Design and other open‑source models on InstructTTS‑Eval for instruction following and expressive generation
- Qwen3‑TTS‑Instruct achieves an average multilingual WER of 2.34%.
- Qwen3‑TTS‑VoiceClone achieves 1.835% WER and 0.789 speaker similarity on the MiniMax TTS multilingual test set across 10 languages, outperforming ElevenLabs and MiniMax
Congratz to the team!
NVIDIA's Personaplex model is super impressive, enabling natural, human-like conversations while responding with nods and affirmations!
Super natural sounding + can run locally at 7B size.
- Inspired by Moshi from @kyutai_labs
- Full duplex, which means AI listens while talking (no more awkward AI pauses
- Handles interruptions, backchannels, turn-taking
Congrats on the team!!
Just finished reading a paper showing that modular speech-to-speech systems can achieve sub-sec latency like end-end models while having fine-grained controllability.
Summary in thread 🧵
Future work (6/6)
- They plan to introduce models like acoustic echo cancellation to improve stability
- May also add more generative models to get more flexible and expressive audio output
- Congrats to the MoE Key Lab of AI and X-LANCE Lab (SJTU), Shanghai AI Laboratory, BIGAI, Fudan University, Northwestern Polytechnical University, and SIAT-CAS for this paper