Why is voice so hard with A.I.?
With the recent promise of OpenAI to launch voice and now the recent excitement around Moshi (an open AI voice model), it seems like we may be closer to solving voice A.I.
If you haven't heard about Moshi, you can check it out here: (https://moshi.chat/).
It's the latest entry in the race to solve AI voice conversations. At RAIA (https://raiabot.com), we have been experimenting with Voxia and Bland.ai for different solutions on our platform. Here is a breakdown of why voice AI is so challenging and what the current solutions offer when deploying voice AI.
LLMs are Slow
As you can imagine, Voice AI needs to be fast to eliminate awkward delays during conversations. Unlike SMS, email, or live chat, where a delay makes the AI feel more human, delays in voice interactions have the opposite effect. On average, it takes about 3 to 5 seconds for an LLM to process a request and deliver an answer. Even though new models are improving in speed, you still have to account for the delay in processing Speech-to-Text (STT) and Text-to-Speech (TTS), and the delivery lag (typically done using Twilio or a similar service). Some “killer demos” of voice AI use streaming via a browser or app to appear faster, but this is not a realistic architecture for deploying a worthwhile application. For business applications, the AI needs to be able to handle phone calls effectively.
The speed challenge will likely be addressed through a few approaches, including leveraging smaller models (SLMs), native support on chips or devices, and faster processing. Most providers require pre-recording the AI conversation to "fake" speed, which can work in limited use cases.
Handling Interrupts
The biggest weakness of AI is its inability to engage in a truly “human” conversation. Today, AI operates like an old-fashioned CB radio—it processes your statement or question once you click send, much like saying "OVER" on a hand radio. Obviously, with voice, there is no send button, so the AI doesn't always know when to respond or even when to listen. This problem was initially "solved" by using a keyword like "Hey Siri" or "Alexa," and the AI would respond. The AI would know you were done responding by waiting for an extended period of silence.
However, conversations are more fluid and dynamic, with pauses lasting from one to ten seconds. In many cases, people talk over each other, usually adding on or agreeing. These variations in pauses and interruptions cause havoc for AI. LLMs are not wired to understand the nuances of language and communication; they just process the best possible response.
We may not truly experience AI engaging in human-like conversations until AGI becomes a reality. Until then, voice applications may have realistic voices and dynamically answer questions, but they will easily get confused with over-talk and interruptions.
Implementation and Cost
Another reason AI voice is difficult is the current options for deploying a complete solution. For consumer apps, companies like Apple and Google will integrate AI natively into their phones. However, for business applications, multiple components need to be connected. We did it natively in RAIA connecting three different platforms.
The first piece is STT (Speech-to-Text). STT has been around for a long time and has become quite good. You can experience it when talking into your phone to write a text message. Since most business apps use Twilio or similar services, STT is already embedded in their platforms. The next piece is TTS (Text-to-Speech). This is also usually available in voice platforms, but the speech often sounds robotic and lacks the human inflections businesses want. Companies like Google, Apple, and ElevenLabs are working on more human-like AI voice models. The final piece is the actual voice platform that connects phone calls to your app (e.g., Twilio). As you can see, there is no end-to-end solution, requiring integration and a bit of magic to make it work for business apps.
Since it is a complex solution requiring multiple vendors (especially for the voice models), the cost is driven up. Most solution providers charge per minute due to variable processing costs, which vary depending on the provider.
It is clear that consumer apps and devices will be the first to deploy reasonably good AI voice solutions, as they can eliminate many delay issues and natively increase performance. For business apps, it may take some time for providers to decrease delays, handle interruptions, and make it affordable enough for companies to seriously consider replacing their offshore call centers.
The Moshi responses were almost unnaturally quick. Crazy how far voice has come and how much further it will go.