Today we're releasing our first open source TTS model, TADA!
TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency.
This means:
→ Zero content hallucinations across 1,000+ test samples
→ 5x faster than similar-grade LLM-based TTS
→ Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems
→ Free transcript alongside audio with no added latency
Today, voice models have no problem generating “angry” or “sad” expressions.
But ask for:
→ bored + fast
→ joy + shy
→ disappointment + confident
…and most systems collapse into stereotypes.
Our latest research blog explores why this happens — and how disentangling emotion from voice at the data layer improves expressive control. Read more below!
We’re excited to launch the 2026 ACII Dyadic Contest (DaiKon) Workshop & Challenge—a new benchmark for modeling emotional influence in dyadic dialogue.
Explore a sample of our conversational audio dataset: 945 sessions, 743 hours, across 5 languages.
Submissions due May 25. We look forward to your participation!
Today, we're shipping MLX support for TADA, our open-source text-to-speech model, which means the entire pipeline (LLM, flow-matching, and decoder) can now run locally on any Apple Silicon device. We're seeing a 45% reduction in memory usage and a 10x speed-up when using it quantized. With these improvements, you can use TADA on-device for OpenClaw or any personal chatbot.
If you own a MacBook, Mac Mini, or Mac Studio, record a 10-second clip of any voice, type any text, and get high-quality, natural and expressive speech in real-time. Completely offline, completely free.
I made a @huggingface Space @Gradio demo for TADA to make the paper’s workflow easier to explore.
The original demo was a bit confusing, so this one is more guided and helps you understand what’s really going on — and in what order the pipeline is supposed to work.
Today we're releasing our first open source TTS model, TADA!
TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency.
This means:
→ Zero content hallucinations across 1,000+ test samples
→ 5x faster than similar-grade LLM-based TTS
→ Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems
→ Free transcript alongside audio with no added latency