Deepstory – Synthesize in Voice class and combine audio arrays

To separate the codes clearly, I’ve written a class named Voice to wrap all the contents related to dctts and enable a context manager style in Python to handle model loading, closing and clearing memory easily.

class Voice:
    norm_factor = 3.0

    def __init__(self, speaker):
        self.speaker = speaker
        self.text2mel = None
        self.ssrn = None

    def __enter__(self):
        self.load()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

    def load(self):
        self.text2mel = Text2Mel(vocab).to(device).eval()
        self.text2mel.load_state_dict(torch.load(f'data/dctts/{self.speaker}/t2m.pth')['state_dict'])
        self.ssrn = SSRN().to(device).eval()
        self.ssrn.load_state_dict(torch.load(f'data/dctts/{self.speaker}/ssrn.pth')['state_dict'])

    def close(self):
        del self.text2mel
        del self.ssrn
        torch.cuda.empty_cache()

    # referenced from original repo
    def synthesize(self, text):
        with torch.no_grad():  # no grad to save memory
            normalized_text = text_normalize(text) + "E"  # text normalization, E: EOS
            L = torch.from_numpy(np.array([[char2idx[char] for char in normalized_text]], np.long)).to(device)
            zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
            Y = zeros

            while True:
                _, Y_t, A = self.text2mel(L, Y, monotonic_attention=True)
                Y = torch.cat((zeros, Y_t), -1)
                _, attention = torch.max(A[0, :, -1], 0)
                attention = attention.item()
                if L[0, attention] == vocab.index('E'):  # EOS
                    break

            _, Z = self.ssrn(Y)  # batch ssrn instead?
            Z = Z.cpu().detach().numpy()

        wav = spectrogram2wav(Z[0, :, :].T)
        # normalize the audio with pydub
        audioseg = AudioSegment(wav.tobytes(), sample_width=2, frame_rate=hp.sr, channels=1)
        # normalized = effects.normalize(audioseg, self.norm_factor)
        normalized = audioseg.apply_gain(-30 - audioseg.dBFS)
        wav = np.array(normalized.get_array_of_samples())
        return wav

When loading the model, it is important to call to .eval() method so that batchnorm or dropout layers will work in eval mode instead of training mode. When closing the model, the text2mel and ssrn objects are deleted, then I call the empty cache method to clear video memory.

The synthesize method is mostly referenced from the original code, the thing I added here is to convert the spectrogram to NumPy array of wav, and I use pydub to normalize the volume since I found out that if the output wav volume is too high in spectrogram2wav, the audio has clicking and high pitch sounds. Doing the volume normalization after would prevent this.

To use it in the deepstory class:

for speaker, sentence_ids in self.speaker_dict.items():
    with Voice(speaker) as voice:
        for i in sentence_ids:
            self.sentence_dicts[i]['wav'] = voice.synthesize(self.sentence_dicts[i]['text'])

As mentioned, I want to synthesize all audio belongs to a single speaker once to eliminate extra model loading time if I loop through each sentence. I loop through the speaker_dict and synthesize the sentence by the index listed inside the speak_dict as I and put it back to sentence_dicts to write a new key ‘wav’ that stores the output.

Combining audio

In order to prepare it for the next ‘base’ video generation using speech-driven animation, the ‘wav’ inside each sentence_dict needs to be processed in a way so that it’s not generating video in every single sentence, but a combined version of sentences from the same speaker.

For example:

Geralt: Hello

Geralt: How are you

Yennefer: I’m fine.

Yenenfer: And you?

Instead of generating four videos for four items in sentence_dicts, I combine an audio array of the same speaker, so that it only generates twice.

def combine_wavs(self):
    """Concat wavs of same speaker, so that video of speaker can be made easily"""
    wavs_dicts = []
    wavs_dict = {}
    last_speaker = ''
    for i, sentence_dict in enumerate(self.sentence_dicts):
        wav = sentence_dict['wav']
        # Add silence between lines
        if sentence_dict['begin']:
            wav = np.pad(wav, (get_silence(0.5), 0), 'constant')  # Every line has 0.5s silence

        if i != 0 and last_speaker != sentence_dict['speaker']:
            wavs_dict['speaker'] = last_speaker
            # Add silence between each sentence within a line, default 0.15s
            wavs_dict['wav'] = np.concatenate(
                [*intersperse(np.zeros(get_silence(0.15), dtype=np.int16), wavs_dict['wav'])], axis=None)
            # pad silence at the end
            wavs_dict['wav'] = np.pad(wavs_dict['wav'], (0, get_silence(0.5)), 'constant')
            wavs_dicts.append(wavs_dict)
            wavs_dict = {}

        if 'wav' not in wavs_dict:
            wavs_dict['wav'] = [wav]
        else:
            wavs_dict['wav'].append(wav)
        last_speaker = sentence_dict['speaker']

    if wavs_dict:
        wavs_dict['speaker'] = last_speaker
        # Add silence between each sentence within a line, default 0.15s
        wavs_dict['wav'] = np.concatenate(
            [*intersperse(np.zeros(get_silence(0.15), dtype=np.int16), wavs_dict['wav'])], axis=None)
        # pad silence at the end
        wavs_dict['wav'] = np.pad(wavs_dict['wav'], (0, get_silence(0.5)), 'constant')
        wavs_dicts.append(wavs_dict)
    # TODO: add silence according to punctuation
    self.wav = np.concatenate([wavs_dict['wav'] for wavs_dict in wavs_dicts], axis=None)
    self.wavs_dicts = wavs_dicts
    # scipy.io.wavfile.write('export/combined.wav', hp.sr, self.wav)

Referenced from my audio pre-process project combine code, I use the same padding and intersperse technique to add in the silence.