I'm currently developing a voice agent and I want to ensure that I take the right approach before potentially over-engineering the solution. My objective is to create an agent that can handle inbound and outbound phone calls, engaging in natural conversations in English, Arabic, and Spanish. I aim to utilize Azure Neural TTS to provide realistic voice output. During conversations, the agent needs to gather essential details like the patient's name, appointment date, and reason for the visit, confirm the booking, and then save all this information in Cosmos DB.
At this point, I'm considering using Azure Communication Services or Twilio for handling telephony, Azure Speech Services for converting speech to text and vice versa, and Azure OpenAI (GPT-4/4o-mini) for conversational intelligence and extracting key information. I'll also use Cosmos DB for session management and Azure Functions for backend orchestration. Any tips, experiences, or references to similar projects would be greatly appreciated! Thanks!
1 Answer
I think you should consider looking into real-time voice APIs like gpt-realtime. They are built for lower latency and might offer what you need without the extra complexity. It runs on a 4o model but is optimized for real-time responses, which could enhance user experience significantly.
Definitely looking into it, thanks!
I'm interested in that too! But how would it fit into the Azure setup I'm planning?