Ming Tu @tuming628 - Twitter Profile

15 days ago

@JulianSlzr @kmisiunas Glad to see it actually works even without audio reconstruction loss. Will play with it locally in my Hermes agent.

0

1

0

14

Ming Tu @tuming628

about 1 month ago

@_alex_kirillov_ Congrats on the release

0

155

Ming Tu @tuming628

about 1 month ago

@levinstanley @OpenAI Cool was the video edited? It looks so fast

1

0

189

Ming Tu @tuming628

about 1 month ago

Hope it's a full-duplex model

Sam Altman

@sama

about 1 month ago

people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)

869

7K

289

635

507K

0

1

0

60

Who to follow

Tao Yu

@taoyds

@XLangNLP lab, asst. prof. @HKUniversity. author of OpenCUA, OSWorld, Aguvis, Spider, OpenAgents, Text2Reward, Instructor.

Shuyan Zhou

@shuyanzh36

assistant professor @dukecompsci, phd @LTIatCMU | creator of webarena

Yu Gu

@yugu_nlp

Co-Founder @NeoCognition

Ming Tu @tuming628

about 1 month ago

it's actually not

Ming Tu @tuming628

about 1 month ago

@rdesh26 could be the full-duplex voice mode

0

1

0

147

0

1

0

42

Ming Tu @tuming628

about 1 month ago

@rdesh26 could be the full-duplex voice mode

0

1

0

147

Ming Tu @tuming628

2 months ago

The world doesn't wait its turn. Neither should conversational AI. Let's step into the Full-Duplex era.

0

52

Ming Tu @tuming628

2 months ago

For the past year, I have been working on the development of Seeduplex. Today, we officially launched the industry's first native full-duplex speech LLM, completely replacing the half-duplex and turn-by-turn system released early last year. https://t.co/rukxhsesFp

2

4

0

1

165

Ming Tu @tuming628

2 months ago

This architecture is now fully deployed in production on the Doubao App, processing continuous, real-time voice interactions for hundreds of millions of users. Read the technical blog for more details.

0

1

0

50

Ming Tu @tuming628

3 months ago

@JulianSlzr @rdesh26 True. It doesn't need to be an end-to-end model to achieve full-duplex experience, especially if we think text2speech is a tool that can be called when the model/system believe it's time to speak something.

0

1

0

31

Ming Tu @tuming628

4 months ago

Our recent work uses Reinforcement Learning (GRPO) and MLLM-based rewards to fine-tune audio-driven video models, significantly improving lip-sync and natural expressiveness.

arXiv Sound @ArxivSound

4 months ago

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu, "FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation," https://t.co/ZpRIVcx7BN

0

1

295

0

2

0

1

139

Ming Tu @tuming628

4 months ago

@CarlZha the doort mat with "taoyuanli" kills it. Should be in Chinese caligraphy

0

185

Ming Tu @tuming628

4 months ago

@rdesh26 One example is using LLM to generate both slides and presentation. In this case, the slides content and spoken content may be different. Then, LLM needs to decide when to take the action to use TTS.

0

30

Ming Tu @tuming628

4 months ago

For text-only systems, a system includes AI models, tools and harness (tool usage, context&memory management, etc). In this sense, both ASR and TTS can be considered as tools.

Desh Raj

@rdesh26

4 months ago

Voice AI has grown a lot recently, and definitions of models/systems have become somewhat vague. Let's put down some basics. 1. AI "models" are not AI "systems". Models are the core units that build up a system. For text-only systems, the two are trivially equivalent (discounting the BPE tokenizer/detokenizer), but not for voice. For voice AI systems, examples of model may be ASR, TTS, LLM, SpeechLLM, OmniLLM, etc. 2. A model is the smallest replaceable unit within a system. For example, an STT model (user speech in / agent text out) often contains a speech encoder + an LLM, but neither of these components can be replaced without having to train the model again. 3. A speech-to-speech "system" (often called a voice agent) may take many forms and comprise many components, but it is always based on two requirements: (A) response generation --> what/how to respond (B) duplex control --> when to talk. Traditionally, (A) has been handled through an ASR/LLM/TTS cascade. Most of the current S2S modeling research aims to replace this pipeline with fewer models (either STT+TTS or S2S). Most systems still rely on external VADs and WebRTC for (B), with the famous exception of "full-duplex" models like Moshi. 4a. A SpeechLLM is a model that takes text+speech input, but only generates text output. It is also called a "speech understanding" model. 4b. An OmniLLM is a SpeechLLM that also generates speech (either codecs or continuous latents). It is also called a "speech generation" model (not to be confused with a TTS). 5. A speech-to-speech system is considered "realtime" if it satisfies 3 conditions: low latency (< 1s), streaming audio in/out, and barge-in/interruption handling. It can also be called a full-duplex system (not to be confused with a full-duplex "model").

1

29

0

9

2K

1

0

336