Tweaking TTS model for increased precision

So I’ve been having some troubles training the TTS model, no matter how many batch size I’ve tried (usually 32/64, not sure if that even made any difference), how big the learning rate (0.005 is the default, so I change it to 0.001 or even lower), the validation curve always goes plateau after around 80K steps. And sometimes it even shows a slight trend indicating overfitting. Still, it sounds normal, similar, in a way that they all perform similar, with a word or two begin pronounced more clearly (I’m quite sure that can be addressed by using another vocoder). The biggest problem is that it doesn’t perform well on long sentences, or even just a comma would ‘kill’ the whole sentence.

So I’ve tried something else. Since I saw the hidden dimension unit of the SSRN in the repo that I’m using has been increased by 128, I’ve decided to do the same to Text2Mel hidden dimension unit. I increased it from 256 by firstly 128, then 256. When it was increased by 128, the problem of skipping some words after the comma seems to improve a lot, and the robotic sounds seem to decrease. This is quite surprising, cause I really don’t know what it really does. But my simple guess is that with a more hidden dimension unit, it’s going to be able to learn with a larger vocab size (I added some comma and exclamation mark). And the size of the model is of course doubled, and the time required for synthesizing is also doubled.

Well the worst part is definitely generating more than 60 steps (I check from 30k steps to 90k steps), with each step contain 40 audio files to be checked. I mean to listen to all of them. So today I’ve spent five hours listening to the difference and select the best model. Oh yes… another day for tweaking and exploring the TTS model, after all, it is the core of my project. I would strive for audio quality more than video quality in this case, since there is little I could actually do in the latter.

So back to the increased hidden dimension units, with 128 it got better, and with 256 it seems to get even better. I mean, basically, all the words are pronounced, with few minor wrong pronunciations only. You know, it’s English, it’s totally not what it sounds like. So some implementations actually processed all the text back to the phonetic sounds, so that the sounds are actually regular. I guess I should also do the same, but first I must finish the ‘Deepstory’ web framework. I should combine all the modules into this app first. And I can finally start working on it with this model.