Most voice pipelines still look like this:
audio → transcribe → text → model → act
The problem is step one. The second you turn audio into text, you throw away tone, hesitation, sarcasm, stress. The signal that told you what the person actually meant.
So everything you built after that ran on the transcript. The actual conversation was already gone.
@modulate_ai trained Velma on 550M+ hours of raw audio to skip that step entirely. One model that listens to the audio instead of reading a summary of it.
#1 on the conversation understanding benchmark. 10x cheaper than running it through an LLM.
The part that makes this more interesting: Velma has already been running in production inside Call of Duty, GTA Online, and Fortune 500 contact centers. Now the API is open to everyone.
Hot take: most voice AI isn't actually understanding speech.
It's reading a transcript of it.
There's a meaningful difference. And it's why voice pipelines keep failing at the moments that matter most.
Dropping something tomorrow that takes a very different approach 👀
You've spent hours tuning your voice pipeline.
Better STT model. Cleaner NLP. More labels.
And it still misses the call where a customer was clearly about to churn.
The problem isn't your implementation. It's the architecture.
Something different is coming @modulate_ai 👀
95% of enterprise AI deployments fail.
That’s not a tooling problem. It’s a design problem.
In this clip, our CEO @mpappas74 breaks down why most AI products never make it past the demo stage - and what businesses actually need instead.
Not another “AI employee” to manage.
Tools that fit into real workflows and solve specific problems.
At @modulate_ai, that’s the lens we build through.
There’s an AI for transcription.
🦾 https://t.co/Fao1vzC2uA
if you’re building with voice, details matter:
Grok STT: $0.10/hr
→ transcripts only
@Modulate’s Velma: $0.03/hr
→ 14.9% WER
→ emotion detection
→ accent detection
→ PII redaction
Test it yourself...👇
Voice fraud isn’t just a security problem.
It’s a massive, ongoing cost center.
In this video, I break down what it’s *actually* costing businesses today - and it’s more than most people realize 👀
There are two layers to it:
1. Direct losses $$$
When voice fraud hits, the damage can be immediate - and in some cases, reach hundreds of millions. Often unrecoverable.
2. The cost of trying to prevent it $$$$
Even if you’re never breached, you’re still paying:
- Added authentication friction that slows down users
- Frustrated customers - and lost revenue
- Teams tied up auditing calls, running investigations, and handling compliance
All of that adds up to tens of millions in ongoing operational cost.
So the real question isn’t “what happens if we get hit?”
It’s “how much are we already spending because this risk exists?”
So here’s what I’m curious about:
Are you betting on generalist AI to handle critical workflows?
Or are you moving toward more specialized systems you can actually control and rely on?
I share what I think in the video - and why we’ve taken a different approach at @modulate_ai
“Can’t my LLM provider just solve this too?”
I hear this all the time - and I think it’s the wrong question.
Because in practice, the more general a system tries to be, the harder it is to trust for any specific task.
We’ve seen this before with software. Specialization wins when reliability matters 🧵
AI regulation is solving the wrong problem.
Right now, most policies are built around generative AI: models like ChatGPT that create content (and yes, can hallucinate)
But that’s only half the picture.
There’s another category: analytic AI.
Systems designed to understand what’s happening and return fixed, verifiable answers - no guessing, no hallucinations.
In this clip, our CEO @mpappas74 breaks down why treating both the same is a mistake - and how current regulations are unintentionally slowing down tools that don’t carry the same risks.
At Modulate, this distinction is core to how we build.
Because not all AI should be regulated like it makes things up 👀
@OpenAI this is the right direction.
but the interesting challenge in voice isn’t just adding more reasoning.
it’s reasoning while handling messy human conversation in real time:
- interruptions
- overlap
- emotion shifts
- ambiguity
that’s where voice stops being “LLM + audio” and becomes a completely different systems problem.
feels like the industry is finally converging on that.
The AI playbook says: more data, more compute, bigger models.
We don’t buy it.
At Modulate, this is how we think + how we build.
In this clip, our CEO @mpappas74 breaks down why focused data + real insight beats brute force.
We’re a team of ~40, and that approach has led to:
- Transcription models outperforming @OpenAI on accuracy
- Deepfake detection models topping the @huggingface speech arena leaderboard
Not by hoovering the internet. By using the right data.
Because better > bigger.
you’re right on the interface shift - voice + agents changes behavior.
but this only works if it’s reliable under pressure.
speaking to your computer is easy.
trusting it to execute is harder.
the moment it:
- mishears a command
- loses context mid-task
- acts on the wrong intent
it breaks the loop.
we’ve seen this with voice - sounding human is solved.
not breaking on real, messy input isn’t. that’s the gap between “cool demo” and “default way of working.”
BiDi is a big step - but it’s not the unlock people think it is.
talking while listening is table stakes for feeling human.
the hard part is doing that without breaking:
- overlap without mishearing
- reacting in real time without drifting
- actually understanding intent, not just back channeling “yeah”
we’ve seen this: you can make it feel 100x better… and still be wrong.
voice isn’t gated by model size or personality anymore.
it’s gated by how well it holds up on messy, real conversations.
that’s the bar. and most systems still don’t clear it.
We’re @hackernoon Company of the Week
Voice AI breaks down when things get real: messy audio, emotion, overlap, intent.
So we built Velma - the first Ensemble Listening Model, trained on 550M hours of real-world audio, designed to understand speech as it actually happens (not sanitized benchmarks).
And ToxMod - real-time voice moderation that detects how something is said, not just the words.
This isn’t research. It’s deployed at scale across Fortune 500 platforms today 🌐
Voice is the hardest problem in AI. It’s also the most human.
We’re building the infrastructure to make it work -safely, accurately, in real time.