Gradium is out of stealth to solve voice. We raised $70M and after only 3 months we’re releasing our transcription and synthesis products to power the next generation of voice AI.
AIEWF next week. We'll be at booth U-G8 with @pipecat_ai, and our CEO @neilzegh is giving 3 talks:
→ Your voice agent is just a walkie-talkie
→ Voice is the universal interface, w/ @kwindla
→ Everyone gets a digital clone
Today we launch stt-translate and s2s-translate: real-time speech-to-text and speech-to-speech translation. They compete with gemini-3.5-live-translate and gpt-realtime-translate on latency and quality, while allowing you to speak in any voice from our catalog or one you clone. Try them for free today on https://t.co/HLVvh94Kok
Long flights always give me more ideas to think about what's missing around us.
Few prompts later, here's Scribble Story.
On-device fully local pipeline to convert scribblings into a short story you can listen to.
Using @GradiumAI Phonon and @Alibaba_Qwen
60+ new voices live in the Gradium catalogue. English, Spanish, French, German, and Portuguese, with eight regional accents across them. https://t.co/ir8tJzCp6J
We upgraded Gradium TTS for the cases voice agents can't get wrong: phone numbers, codes, email addresses read back right the first time. Couple of examples: English: 97% on emails, top of the field. French: leads every competitor we benchmarked. Samples + methodology → https://t.co/mXjGwCCa3O
In this joint work with @kyutai_labs, we design a reward model for conversational dynamics to teach full-duplex models how a human behaves in conversation, using cues to know when to interrupt, backchannel or stay silent.
New paper: Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
We use RL to post-train speech models (Moshi and PersonaPlex) to talk more like a human: to know when to respond, when to wait, and when to nod along with “yeah”s and “okay”s when listening.
We'll be at @VivaTech next week showcasing our models. Come find us at Booth 7.2 | 2F13 with @awscloud all week, and on the @LaFrenchTech booth on Wednesday.
@neilzegh is giving two talks: Wed 17th, 5:20pm, @nvidia Stage 1 and on Fri, 10am, Théâtre AWS
Learn how to build an audiobook voice agent using Gradium and @pipecat_ai
Gradium's TTS handles the narration and Pipecat's built-in WebRTC transport delivers the audio to the browser.
Reasoning LLMs typically take 2-3 seconds to start emitting tokens. In a voice agent, that's 2-3 seconds of silence after the user finishes speaking.
The @MiniMax_AI team just shipped a community contribution to Gradbot with two models running in parallel. MiniMax-M2-her produces a short acknowledgement that starts streaming to TTS immediately, while MiniMax-M2.7 runs in the background reasoning and tool calls.
Thanks to @davidtaoweiji for this contribution. Checkout our readme for more details.
https://t.co/gxSTdrCiAm
A full house at the @joinhexa office in Paris yesterday.
Our CTO @olivierteboul joined the discussion by sharing why low latency matters for voice agents and how Gradium models support enterprise use cases for voice AI.
"I'd like to cancel my flight from Boston to..." You pause to check a date. The agent cuts in: "Got it, where to?" Now you're talking over it to finish your own sentence.
That's acoustic turn detection. Semantic VAD waits because it knows you're not done: https://t.co/1NPxkPGfyC
👉 https://t.co/4FKgnYn8vE
At SlatorCon London, we discussed voice #AI capabilities and deployments, and how voice AI 🗣️🤖 is shifting the operational infrastructure ⚙️ of enterprises with Neil Zeghidour, Co-Founder and CEO at @GradiumAI, Arkadiusz Kwapiszewski, Head of Agent Design & Engineering at @polyaivoice, and Peadar Coyle, CTO & Co-Founder at AudioStack.
#VoiceAI #ConversationalAI #LanguageAI @neilzegh@Springcoil
Berlin was geht ab, Tavily ist jetzt in town! We're here with @GradiumAI showing off our new voice integration and hosting a hackathon alongside @nebiusai and @cursor_ai. You won't want to miss this one.
https://t.co/G6l8wxgZVT
The 100-token input padding is gone.
Short replies like "yes, that works" used to need filler before generation.
Now they don't, so voice agents return first audio much faster on the short turns that fill real conversation.