The dangerous fintech voice agent bug is helpfulness.
Customer gives a card number. ASR catches most of it. Agent says, "I heard your card ends in 4532, is that correct?"
Helpful for accuracy. Bad for PCI.
https://t.co/qWxQjLWIh8
Heading to Money2020 Europe. Every fintech will pitch an agentic AI launch. Almost none will say how they tested it.
We wrote up what testing voice agents in financial services actually looks like.
Hamming is heading to Europe this June.
First stop: Money 20/20 Amsterdam, June 1–4.
The voice AI questions we’re hearing from European teams have changed.
A year ago: “Does it work?”
Now: “How does it fail, and how will we know in production?”
Teams running voice agents at real scale across payments, banking, insurance, and enterprise CX do not want lighter QA. They want testing and monitoring infrastructure that is stricter than what they can build internally.
That is the conversation we’re excited to have in Amsterdam.
If you’ll be there, send us a DM. Coffee on us.
What breaks first at 1,000 concurrent calls?
Almost never the model. Almost always the plumbing.
Rate limits → STT streams → TTS concurrency caps → phone pool → DB locks.
Full load testing guide 👇
https://t.co/3HeUSMiEBC
We’re hiring a GTM Engineer at Hamming. Not an SDR. Not an AE.
- You’ll do outbound, discovery, demos, closing, and expansion; basically, whatever gets customers successful.
- Voice AI is a technical sale. You’ll be on screen shares with CTOs and engineering teams.
Atypical backgrounds encouraged: ex-founders, lawyers, FDEs, chiefs of staff from small teams.
Send us a one-pager, PDF, or Google Doc. Resume optional.
1. Why Hamming?
2. Why you?
3. What would you do in your first 30 days?
4. One strong opinion on voice AI.
"Not only can we have Hamming do it instead of us, we can have Hamming do it four or five times, all at once instead of having one person call and do it one time." - Tosh Toida, QA Lead, Mia
Voice AI breaks differently than text AI: silent failures, exhausted QA, edge cases that only show up at scale.
Three customer quotes from the last quarter:
"I find talking to the agents completely socially exhausting. The platform offers all of these different personas that the team is not able to replicate." - Blake Jones, AI Engineer, Basata
Building voice agents is 70% testing.
"It's like going from manual labor to using a tractor. You can prompt an agent in 30-45 minutes, but testing takes the next 2-3 hours. Building voice agents is 70% testing. Hamming makes that 70% manageable." Ahmad Rufai Yusuf, Forward Deployed Engineer at Bland Labs:
Manual testing eats engineering days. Hamming runs them in parallel.
Read the Bland Labs case study. 👇
Three Bay Area events this week. Three different rooms shipping the same hard problem: agents that hold up in production.
SaaStr AI Annual is where the B2B SaaS buyers are. LangChain Interrupt is where the agent-debug crowd is. AI Council is where the agent-eval builders are.
If you're building voice or agentic features, find us. 👇
https://t.co/UiQ0C0rLGu · https://t.co/vwK69TxoEp · https://t.co/9by1KcvtiV
Voice agents that survive production are tested like infrastructure, not chat.
Regression suites on every change. Simulation across thousands of personas. Real-time drift monitoring. Incident response runbooks.
Production voice quality in 2026:
https://t.co/gluGQISjS8
95-96% agreement with human evaluators.
That is our automated voice agent eval accuracy across 4M+ production calls.
If your eval system disagrees with humans, you can't trust it. If it agrees but doesn't scale, you're back to manual QA. We solved both.
Voice agent latency isn't one number. It is a chain.
ASR finalization, LLM TTFT, TTS TTFB, audio buffering, network jitter.
Optimizing one segment without measuring the others can make perceived latency worse.
Full breakdown:
https://t.co/5tFTSNRgE3
@twilio ConversationRelay crossed 25M+ developer minutes.
Voice plumbing is solved. The bottleneck moved up the stack: STT errors at authentication, silent model updates, background noise dropping digits.
These don't show up in load tests. They show up in production.
Voice AI teams in production:
What is the failure mode you can never reproduce in dev but always see at scale?
We've seen accent edge cases, API rate limits, codec-specific TTS bugs.
What's yours?
We'll be at Stripe Sessions in SF, Apr 29-30 at Moscone West.
Voice + payments are converging.
When a voice agent takes a payment, semantic quality isn't enough. You need numeric fidelity and assertion-level verification.
DM if you're there.