Oh silly me. I’ve spent a whole day trying to split the audio into smaller parts and transcribe each split clip one by one. I mean it did work, apart from the accuracy part, which constitutes greatly to the attention layer of DC-TTS. Therefore, even a slight error in the script could not be good for the model. Well, I mean instead of splitting the file and figuring out which audio begins with which part of the sentence, why I didn’t just split them and combine them back together to eliminate the additional silence? Wow, I mean, I think I thought about that because in that way, the original dialog script can be used.
The Deepspeech transcription part is still useful because then I would know if some parts of the audio don’t sound English or just some pauses. In that way, I can really normalize the gap between sentences, so that the comma would make sense! I should have just carried on with the deepstory main web app. Oops.