Frustrations and surprise

Certainly, deep learning isn’t some sort of magic trick that one can simply provide some arbitrary data to get desirable outcomes. Encountered difficulties in Windows, mainly CUDA and cuDNN, I am forced to return to Ubuntu to continue my project. I also stumbled upon Tacotron 2. Initially I thought it has much desirable outcome, then I realized it was difficult to train.

I had to return to DCTTS again. After carefully watching the video by sentdex, I realized he didn’t say how to achieve his results, that is, to create model using from limited data. His methods was actually using the female model, then somehow used transfer learning in order achieve it. But how!? What is this magical trick? His result was pretty good, I hoped he had provided his code and any sort of tutorial, after all, how am I supposed to implement this from scratch?

Time for me to continue the deep learning courses on Coursera, and start learning all the foundations of deep learning in order to create it. It seems that even with sufficient data, there’s still some days before achieving any ideal results. Hopefully I will make it before May.

So after just some random testing and replacing, you know, nothing really theoretical and technical in terms of deep learning neural network stuff, I replaced the Text2Mel model with the one that I trained with Witcher’s Geralt (Doug Cockle) voice. Surprisingly, combined with the pretrained LJSpeech SSRN, I was able to produce some voice that sounds just like Geralt, but of course a bit robotic. Perhaps I shall try with other voices as well.

Well, feels good to have some results, not really presentable, but still would be presentable nonetheless.

His voice:

Here are some samples:

Geralt bowed cursorily, looking at the knights.

40K steps:

70K steps:

Both wore armour and crimson cloaks with the emblem of the White Rose on their left shoulder.

40K steps:

70K steps:

He was somewhat surprised as, so far as he knew, there was no Commandery of that Order in the neighbourhood.

40K steps:

70K steps:

I found out the one that’s trained for 40K steps is able to produce clips correctly and clearer than others.

The more I trained beyond 40k, the more loss I found, well inaccurate. Maybe some overfitting?