Hey everyone! I'm working on an AI Voice Agent using the ESP32 S3 Devkit module, but I'm facing a big hurdle: the costs for Text-to-Speech (TTS) and Speech-to-Text (STT) are really adding up. Currently, I'm using OpenAI Whisper for STT and ElevenLabs for TTS, and I estimate I'll need about 60 minutes of usage each day—about 600 characters per minute. Here's what the breakdown looks like:
- Whisper (STT): ~$0.36/hour
- ElevenLabs (TTS, Creator plan): ~$9.00/hour
- Total: around $9.36 per hour, which translates to about $250 a month for just an hour of use each day. Plus, this doesn't even cover cloud and infrastructure costs.
I'm curious if anyone has tips on how to cut these costs or alternative approaches I should consider!
1 Answer
First off, what are you trying to achieve? Is this a product for sale, or just for personal use? If it's for a product, what compromises can you make to save on costs? For instance, are you okay with a less realistic TTS voice or lower accuracy in speech recognition?

This is for a product to sell. I can cache common phrases and am fine with higher latency, but the TTS needs to sound realistic. I can fallback to a hosted TTS model for minor questions while reserving ElevenLabs for key queries. But using both might lead to different voice results. Any ideas?