Audio pre-processing

When I was researching on transfer learning/fine-tuning(whatever it’s called), I stumbled upon this paper Exploring Transfer Learning for Low Resource Emotional TTS. Yeah, more DCTTS, well I was quite surprised why I only see this now because I just realize I have to redo pretty much everything(minus the parameters experimenting part). Even though these few days I have successfully created a production-ready model for a character’s voice, I still had some questions. One of them is that, what does punctuation actually do? Does being placed with a comma change the intonation or something? Or does it actually create some silence? Well if it creates silence, then it’s not really possible to recreate the silence for each comma that has been turned from other punctuation like “…”. The most important question is, why is it that, when I was training on the LJ dataset, the comma is adapted just fine, but when I was training on my own dataset, some words are not pronounced? Is it by chance?

The answer lies in the pre-processing, clearly, there’s work to be done. Not just text processing, but audio processing, which means trimming the silence, splitting the sentences. It turns out the audio clips usually have few sentences in a single clip, and there are pauses between those new sentences, and that is not good for our attention module in this network. So the best thing is to split the sentences.

There are few things to do:

  1. Decrease the trimming threshold top_db to 20 from 60, so that some non-verbal sound can go away at the beginning and the end.
  2. Split the audio clips, The first thing to do is to process sentence with more than one “ending punctuation”, that means full-stop, exclamation mark and question mark.

The other thing I probably will do is to train a specific layer as instructed in this repo by freezing other layers in the PyTorch implementation. This also means I’m going to add some functionalities in the PyTorch implementation as well. This will be achieved by setting all layers requires_grad = False, then set true to those layers I want to train.

Apart from that, I’ve started building the web app, and I hope to finish the text to the audio part today. Time is ticking.