I've been playing around with the OpenAI Whisper API for speech-to-text conversions, but I've run into a problem. When I send empty audio, the API often returns random words, and I've noticed a lot of it seems to be in Chinese. Is there any way to get a confidence score or some sort of metric to help me filter out these low-confidence responses?
1 Answer
You won't be able to get a confidence score since the transcription models don't work that way. They're designed for actual speech, not blank audio. What you really need to do is segment your audio beforehand. Only send portions that contain speech without long pauses. I actually built an app this weekend that summarizes police scanner calls by chunking audio from mp3s. Without segmenting, I ended up with nonsense transcripts like "Thank you Thank you Thank you..." repeating endlessly.

Thanks, that's super helpful! Quick question: What's the ideal length for those audio chunks? Can it be anything as long as there are no big gaps? Also, I’m using Expo Audio to meter in real-time but I’m struggling with timing. My plan is to buffer and save audio when there's a pause, but if I wait for sound to start recording, I might miss the start. Have you run into voice detection issues like this before?