It turns out, getting rid of all the breaths, laughs, and trimming the silence and replace it with the same duration of silence helps to train a better DCTTS model. The audio source that I used was from a video game, even though the original transcription is pretty accurate, sometimes it missed a thing or two and there are multiple sentences in an audio clip, and that means the audio has long silence between sentences. I thought it would work just fine so I didn’t do any audio pre-processing before using them to train a DCTTS model. The model was working fine BUT something is just off after commas. If the text has a comma, it would stop synthesize the part after comma or sometimes mispronounce words. I tried many things, including increasing the dimension of the hidden unit, which at the end helps with longer sentences, learning rate, and batch size. Well it turns out, you need to give a more regularized/normalized(not sure the right way to call it) data.
A little summary in advance, so I basically used the Librosa library, librosa.effects.trim(), and librosa.effects.split() to achieve this. BUT, some preprocessing needs to be done as well, to let those two be able to recognize the audios and silences better, I apply pre-emphasis to the audio before trimming and splitting, just as the code in DCTTS to extract the vocal better. And also, Librosa has a built-in preemphasis function, but it gives a weird result that the beginning sounds weird compared to the pre-emphasis code from dctts code. In the end, I found .80 is a pretty good choice. The detailed values can be seen in the GitHub repo.
Here are some notebooks that I created for exploring the audio:
(some cells are hidden because there’s a lot)
(warning: some hissing sounds are amplified for some reason, turn down the volume first)
The whole process of audio discovery and some first version of the functions: Here
A better version of the function: Here
Even though I don’t need to transcribe each separated audio clips anymore, I have the idea to transcribe separated clips that have a relatively smaller size(they are probably not words), then to kind of intelligently eliminate silences. So I trimmed audio clips based on silence, filtering out those separated audio clips by transcription using Deepspeech to see if they are real words or just some humming and laughs and then insert a fixed duration of silence back between them.
So after splitting the audio into clips, filtering out non-vocals like sighs and gluing them back together, here are the results:
Before
After
Notice the sigh after ‘but’ is gone
The newly trained model is finally able to synthesize complex sentences without errors. This also means that I can finally put the comma in the sentence to be synthesized without doing some crazy text pre-processing like dividing and text and produce clips and then glue them back together and … tons of steps.
And the most important reason for synthesizing longer sentences with a comma is that if I really separate the sentence by comma, it would first have to synthesize more times. Since the audio data sometimes have the character speaking to himself, speaking in a public space, and slightly few different styles(I don’t have time to listen to almost 20k audio files… yet), with every sentence that begins with a different word, it’s going to copy the style of the words from the respective original audio clips. And it sounded extremely weird after gluing those separately synthesized clips. Like it’s not a sentence anymore. So I was persistent to solve the comma issue and I’m glad I fixed it.