Well, training the audio model was no easy task, and there was too many combinations/too many parameters to be tuned. I was too tired to try any new combination. Instead of perfecting it, I’m going to spend time finishing the whole framework first. The audio result was acceptable, with a slight robotic touch. I’m going to use some of the generated samples to proceed in the next step.
The first thing I’m going to try is Speech Driven Animation, basically it was trained using the GRID dataset, and other two models are provided that are trained with different datasets respectively, however, the results of those two datasets are not desirable. Since I’m not going to train my own model, I will use the GRID model they have trained, which provides the best result in my opinion. However, this model outputs a very low-resolution video, and it seems to only work well with this specific image:
Well, it doesn’t really matter who is speaking, but how they speak. Let’s just try with a video generated from this image.
The good thing is that it’s not resource-demanding and also quite fast to generate a video. For the audio I’ve tested, it seemed to give decent animation with eyes expression. Nonetheless, the speech animation is not as accurate as it should be, the mouth moves though.
Another choice is the VOCA (Voice Operated Character Animation), which provides a more trained models and outputs a video rendered from the generated 3D mesh sequence animation. And it will be able to produce very high res result. I’ll try that later
So here are the results from speech-driven animation. So basically I first give the audio to the model, then it outputs a video of the person saying it. Weirdly enough, the sample given was a cropped image, and not really a squared video, so at first I was quite worried it might give a bad result as the source of the first-order motion model. But amazingly, it adapts (there’s face key points recognition), and the results are pretty good for a fan work. It’s not like it’s production level, but it moves according to the audio and gives an interactive experience, the eyes even moves a little. Most importantly, it’s simple and fast.
And of course Henry Cavill from the Netflix Witcher series:
And while continuing the training process, I read more about the technology, finally getting to know more about Mel spectrograms, and the griffin-lim algorithm to convert it back to audio. It seems that I could use alternate encoder for fixing the noise and robotic sound, I hope someone had done research on it (I think I found it, with samples, but I closed it anyway). So the DCTTS model first converts to audio to Mel spectrograms, and it has two models. The first one is Text2Mel, that is the most important part. Text2Meluse the Mel spectrograms from the audios and texts that I provided to train a convolutional network, with the guided attention since it provides information about when the sound should appear from the position in the provided text. And secondly, in order to perfect the sound, the SSRN model is trained using all the Mel spectrograms so it knows how to reproduce the Mel spectrogram when the Mel spectrogram is coarse or imperfect, to make it sound really natural. The thing is, I didn’t really understand it, the SSRN part mostly, it turns out I should feed all the examples I have to increase the accuracy for this model. And I was only training using word counts more than 5 in the Text2Mel model. So yes, retrain again, no wonder it was giving me better graphs and performance when using only 3 words above. I should not have trained every time I change the number of audios used in Text2Mel, instead I should keep training on all audios with different parameters.
And… So looking at better alternatives for griffin-lim, I found some other Mel to linear spectrogram ‘thing’, there’s wavernn, mel2gan…
I’m gonna try those.