Deepspeech and Audio segmentation

So nothing was going as planned. But I’ve experimented with different thresholds when splitting the audio and managed to get a quite precise one. Automatically, I thought of using Deepspeech and its pretrained model from Mozilla to transcribe it back so that I don’t have to listen to the audio one by one.

It worked. The transcription is quite good, I mean, I only needed it to recognize at least a few words in a sentence so that there is a match to the segment from the original script. Strangely, sometimes the transcription was missing the word at the beginning. My curiosity has driven me to add back the silence to the beginning to see if there are any changes.

It turns out, yes. I am seeing differences by adding a 1s silence at the beginning. So in order to have it transcribed by Deepspeech, I actually have to add back the silence that I trimmed? Ok, I’ll do that. The thing is, it takes like twenty minutes to transcribe about 2000+ audio files. I guess I’ll have to wait a bit again.