@anarchyco@livekit@pipecat_ai@TenFramework Small corrections about Pipecat:
- We are https://t.co/6x1oxDsT5E not https://t.co/4ZeQ3ZurAK
- The "many pre-built components" link should go: https://t.co/BayDTklqoo
@anarchyco@livekit@pipecat_ai@TenFramework Nice write up! It's very exciting to see orchestration pipelines and voice AI, in general, get some attention. 2025 will be a very big year!
Better/faster/cheaper voice AI turn detection with Gemini 2.0
The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent.
The technical terms for this job are "turn detection" or "phrase endpointing."
If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating.
There are a number of ways to approach this. You can:
1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses.
2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics.
3. Train a specialized phrase endpointing model that operates directly on audio.
4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0.
We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches.
Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model.
Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support).
You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.
@trydaily and @Google have partnered to help you build real-time AI agents with the new Gemini Multimodal Live API. Really proud of the team working on @pipecat_ai!
Get started here: https://t.co/nSByfKrNUw.
Gemini 2.0 launched today. Amazing multimodal capabilities, long context windows, fast response times, built-in tools, and top-of-the-leaderboards reasoning capabilities.
Plus a new API — the Multimodal Live API — for conversational AI applications, like voice agents and multimodal copilots.
@Google and Daily have partnered to build Multimodal Live API support into the @pipecat_ai Open Source SDKs for Web, Android, iOS and C++.
The Pipecat SDKs come with echo cancellation and noise reduction, device management, event abstractions, React hooks, and more. They support both direct connections to the Gemini WebSocket API, and WebRTC routing on Daily's global ultra-low latency network.
Build realtime voice agents with Gemini, Pipecat, and Daily.
Links to docs and starter kits in the thread below (1/4)...
We have achieved so much in @pipecat_ai! Thank you all! The community is amazing and keeps growing (1445 on Discord)! Pretty sure it's probably the most complete and powerful conversational AI orchestration framework out there (and this list is even missing a few things!):
Talk to (a bootleg) virtual @benthompson
[Meta-note: I recorded this video in a Waymo. So you're watching an AI experience inside an AI experience.]
We did an internal voice AI hackathon a couple of weeks ago at @trydaily. Several of us are long-time @stratechery fans; @mark_backman had the idea of creating a "talk to Ben Thompson" toy demo.
This kind of project is a really nice testbed for combining RAG with voice.
I'll put some notes about building voice + RAG below, but if you just want to jump to a live demo, there's a link further down in this thread.
The tech stack here breaks down into two parts: preparing and indexing the data, and running the live experience. There are lots and lots of choices right now for chunking, embedding, and storage/retrieval tooling. Mark used these:
- @spacy_io for semantic chunking
- @OpenAI text-embedding-3-small
- @pinecone to store the embeddings
The live app uses:
- @OpenAI gpt-4-o mini
- function calls trigger a @langchain query
- the voice is a @cartesia clone
- @pipecat_ai does the low latency phrase endpointing, interruption handling, context management, and orchestration
- @trydaily Daily Bots voice transport and Pipecat hosting
- the demo app is hosted on @vercel
A link to the full source code of the app (but not the copyrighted Stratechery content) is in the thread below.
Several things about building this are tricky:
- Latency really matters, and it's hard to make function calling + RAG fast. This was an experiment, not a production app, but Mark was still able to get the median total (voice-to-voice) response time down below 1.5s.
In general, we aim for ~800ms for conversational voice AI response times, so this is slower than we want these experiences to be. But the median here isn't terrible. The outliers do feel too slow, though.
- RAG is complicated to get right. Mark did a lot of experimenting with chunking and embeddings. I think this definitely clears the 80/20 bar of being an interesting demo. I'm interested in what you think if you try it!
For a production app, we'd want to do significantly more work on the retrieval subsystem. The quality of the data fetch heavily influences the quality of the conversational output.
I'm convinced that talking to an LLM "personality" is going to be a very, very common thing in the near future.
Sometimes, we'll talk to personalities that are slices of real peoples' public personals. Like this one.
I also think there will be hugely popular personas that are "natively AI," personalities that are not based on a specific, real person.
These new apps pose interesting, interrelated questions about copyright, user expectations and desires, and UI design.
We trained this app on copyrighted Stratechery content and cloned a real person's voice.
This is clearly a copyright violation and of course we'll take the demo down if Ben Thompson objects to it being publicly available. Note that it's not possible to retrieve the copyrighted material, here. It's only possible to get GPT-4o-mini's "remix" of the content.
We needed a "behind a paywall" corpus of content to build an interesting RAG demo, because today's large LLMs are trained on much of the freely accessible information on the Internet.
There are two things to note about that:
1. You can build a decent "clone" of a public person's personality just by creatively prompting a state-of-the-art LLM. You won't necessarily get the specific content grounding you probably want, but writing style usually shines through pretty nicely.
2. Almost all of the content used for training state-of-the-art LLMs is copyrighted content, even when it's not behind a paywall. Courts haven't yet ruled on whether mixing *a lot* of copyrighted content from many sources together constitutes a legal use of copyrighted content. Perhaps this falls under the category of "fair use." Perhaps not.
I went in person to see the Eldred vs Ashcroft oral argument at the Supreme Court in 2002. The court's decision in that case upheld the Digital Millenium Copyright Act. That felt momentous at the time — and wrongly decided.
It seems certain that there will be an even more momentous case about how copyright law applies to large model training.
Perhaps our highly polarized congress will find a way to pass new laws that extend and clarify copyright for this new era. If so, we should hope that corporate lobbyists aren't the primary authors of that law, as they were with the DMCA.
We did not use any of the Stratechery podcast content for this demo, because adding multi-modal, multi-person content was beyond the scope of a hackathon project. But it sure seems like you'd want to add all of that great audio source material to a bigger, production-quality, authorized version of an app like this.
It's less obvious to me whether you would want to try to add in non-Stratechery content from Ben Thompson. (NBA commentary and analysis!) Thompson maintains a "no-tech" X account — @NoTechBen. Should this content separation that makes sense on X also port over to the new generative AI personality world?
Anyway, this is now a very long post ... so go play with the demo if you're so inclined. Link in the next tweet.
Talk to (a bootleg) virtual @benthompson
[Meta-note: I recorded this video in a Waymo. So you're watching an AI experience inside an AI experience.]
We did an internal voice AI hackathon a couple of weeks ago at @trydaily. Several of us are long-time @stratechery fans; @mark_backman had the idea of creating a "talk to Ben Thompson" toy demo.
This kind of project is a really nice testbed for combining RAG with voice.
I'll put some notes about building voice + RAG below, but if you just want to jump to a live demo, there's a link further down in this thread.
The tech stack here breaks down into two parts: preparing and indexing the data, and running the live experience. There are lots and lots of choices right now for chunking, embedding, and storage/retrieval tooling. Mark used these:
- @spacy_io for semantic chunking
- @OpenAI text-embedding-3-small
- @pinecone to store the embeddings
The live app uses:
- @OpenAI gpt-4-o mini
- function calls trigger a @langchain query
- the voice is a @cartesia clone
- @pipecat_ai does the low latency phrase endpointing, interruption handling, context management, and orchestration
- @trydaily Daily Bots voice transport and Pipecat hosting
- the demo app is hosted on @vercel
A link to the full source code of the app (but not the copyrighted Stratechery content) is in the thread below.
Several things about building this are tricky:
- Latency really matters, and it's hard to make function calling + RAG fast. This was an experiment, not a production app, but Mark was still able to get the median total (voice-to-voice) response time down below 1.5s.
In general, we aim for ~800ms for conversational voice AI response times, so this is slower than we want these experiences to be. But the median here isn't terrible. The outliers do feel too slow, though.
- RAG is complicated to get right. Mark did a lot of experimenting with chunking and embeddings. I think this definitely clears the 80/20 bar of being an interesting demo. I'm interested in what you think if you try it!
For a production app, we'd want to do significantly more work on the retrieval subsystem. The quality of the data fetch heavily influences the quality of the conversational output.
I'm convinced that talking to an LLM "personality" is going to be a very, very common thing in the near future.
Sometimes, we'll talk to personalities that are slices of real peoples' public personals. Like this one.
I also think there will be hugely popular personas that are "natively AI," personalities that are not based on a specific, real person.
These new apps pose interesting, interrelated questions about copyright, user expectations and desires, and UI design.
We trained this app on copyrighted Stratechery content and cloned a real person's voice.
This is clearly a copyright violation and of course we'll take the demo down if Ben Thompson objects to it being publicly available. Note that it's not possible to retrieve the copyrighted material, here. It's only possible to get GPT-4o-mini's "remix" of the content.
We needed a "behind a paywall" corpus of content to build an interesting RAG demo, because today's large LLMs are trained on much of the freely accessible information on the Internet.
There are two things to note about that:
1. You can build a decent "clone" of a public person's personality just by creatively prompting a state-of-the-art LLM. You won't necessarily get the specific content grounding you probably want, but writing style usually shines through pretty nicely.
2. Almost all of the content used for training state-of-the-art LLMs is copyrighted content, even when it's not behind a paywall. Courts haven't yet ruled on whether mixing *a lot* of copyrighted content from many sources together constitutes a legal use of copyrighted content. Perhaps this falls under the category of "fair use." Perhaps not.
I went in person to see the Eldred vs Ashcroft oral argument at the Supreme Court in 2002. The court's decision in that case upheld the Digital Millenium Copyright Act. That felt momentous at the time — and wrongly decided.
It seems certain that there will be an even more momentous case about how copyright law applies to large model training.
Perhaps our highly polarized congress will find a way to pass new laws that extend and clarify copyright for this new era. If so, we should hope that corporate lobbyists aren't the primary authors of that law, as they were with the DMCA.
We did not use any of the Stratechery podcast content for this demo, because adding multi-modal, multi-person content was beyond the scope of a hackathon project. But it sure seems like you'd want to add all of that great audio source material to a bigger, production-quality, authorized version of an app like this.
It's less obvious to me whether you would want to try to add in non-Stratechery content from Ben Thompson. (NBA commentary and analysis!) Thompson maintains a "no-tech" X account — @NoTechBen. Should this content separation that makes sense on X also port over to the new generative AI personality world?
Anyway, this is now a very long post ... so go play with the demo if you're so inclined. Link in the next tweet.
The old man skateboard meme guy is my favourite Daily Bots character. He's like, totally rad, you know?
- Try it out: https://t.co/LuGjEcvoC0
- Source code: https://t.co/3kNewlEtb6
(Please upvote on Product Hunt if you have some fun https://t.co/OflvXbnqI8)
The old man skateboard meme guy is my favourite Daily Bots character. He's like, totally rad, you know?
- Try it out: https://t.co/LuGjEcvoC0
- Source code: https://t.co/3kNewlEtb6
(Please upvote on Product Hunt if you have some fun https://t.co/OflvXbnqI8)
We are proud to collaborate with @trydaily on real-time voice #AI. Check out enterprise voice workflows, such as this healthcare patient intake demo running on #Llama 3.1 405B by @AIatMeta! https://t.co/AXz3MWFONX
#VoiceAI#Meta#Inference#LLM#Llama3
My team has been working on a fun holiday project—Talk to Santa Cat. Santa Cat is a kid-friendly, AI-powered virtual character who is designed to have Christmas themed conversations with people of all ages, especially kids.
https://t.co/dfNIHz0si3
I'm happy and grateful to announce @trydaily has raised a $40M Series B led by @RenegadePtnrs. 🎉 @imthemusic is joining our board to work with us on the future of video, audio, and WebRTC.
https://t.co/cbGxQPlFa5
📣 We’re excited to release color theming, our first major customization option for Daily Prebuilt, our embeddable #WebRTC call interface.
Developers can now customize the colors of the video call UI, including the background, text, and icons.
Read on ⏬
🎶 Do you wanna build a frontend? C‘mon let’s go & code! We’ll work together remotely, and then you’ll see, it’s a happy place to be! We’re gonna be best buddies, but now we’re not, cause I don’t know you yet. Do you wanna build a frontend? Or are you more into fullstack? Ok bye