Deepstory – Text normalization

In my code, there are two times of text normalization. The first one is done during loading the sentences, replacing the punctuation like ‘;’, ‘:’, ‘…’ to comma since the model could only recognize characters in a vocab of lower cases plus full stop, comma, question mark, and exclamation mark. The other text normalization is done within the Voice class, which handles model loading and audio synthesis. It converts all characters into lower case and eliminates characters that are not in vocab list(configured in modules/dctts/hparams.py) so that there are no errors when converting the characters into ids that can be represented as a tensor.

Originally, I wrote this part so that sentences can be chopped by commas or chopped by clauses(I haven’t done it) given the fact that the dctts model wasn’t able to synthesize words after commas or if it’s too long. However, after doing audio pre-processing to normalize the silences, the problem goes away, and this part of code remain some functionalities to replace other punctuation to commas in order to indicate pauses, or if you really want to separate commas, but that would give some inconsistent audio. Another important feature is to create a punctuation dictionary so that different lengths of silences can be inserted between sentences to give a more dynamic result.

def parse_text(self, text, default_speaker, separate_comma=False,
               n_gram=2, separate_sentence=False, parse_speaker=True, normalize=True):
    """
    Parse the input text into suitable data structure
    :param n_gram: concat sentences of this max length in a line
    :param text: source
    :param default_speaker: the default speaker if no speaker in specified
    :param separate_comma: split by comma
    :param separate_sentence: split sentence if multiple clauses exist
    :param parse_speaker: bool for turn on/off parse speaker
    :param normalize: to convert common punctuation besides comma to comma
    """

    lines = re.split(r'\r\n|\n\r|\r|\n', text)

    line_speaker_dict = {}
    # TODO: allow speakers not in model_list and later are forced to be replaced
    if parse_speaker:
        # re.match(r'^.*(?=:)', text)
        for i, line in enumerate(lines):
            if re.search(r':|\|', line):
                # ?: non capture group of : and |
                speaker, line = re.split(r'\s*(?::|\|)\s*', line, 1)
                # add entry only if the voice model exist in the folder,
                # the unrecognized one will be changed to default in later code
                if speaker in self.model_list:
                    line_speaker_dict[i] = speaker
                lines[i] = line

    if normalize:
        lines = [normalize_text(line) for line in lines]

    # separate by spacy sentencizer
    lines = [separate(line, n_gram, comma=separate_comma) for line in lines]

    sentence_dicts = []
    for i, line in enumerate(lines):
        for j, sent in enumerate(line):
            if sentence_dicts:
                if sent[0].is_punct and not any(sent[0].text == punct for punct in ['“', '‘']):
                    sentence_dicts[-1]['punct'] = sentence_dicts[-1]['punct'] + sent.text
                    continue
            sentence_dict = {
                'text': sent.text,
                'begin': True if j == 0 else False,
                'punct': '',
                'speaker': line_speaker_dict.get(i, self.model_list[default_speaker])
            }

            while not sentence_dict['text'][-1].isalpha():
                sentence_dict['punct'] = sentence_dict['punct'] + sentence_dict['text'][-1]
                sentence_dict['text'] = sentence_dict['text'][:-1]
            # Reverse the punctuation order since I add it based on the last item
            sentence_dict['punct'] = sentence_dict['punct'][::-1]
            sentence_dict['text'] = sentence_dict['text'] + sentence_dict['punct']
            sentence_dicts.append(sentence_dict)

    speaker_dict = {}
    for i, sentence_dict in enumerate(sentence_dicts):
        if sentence_dict['speaker'] not in speaker_dict:
            speaker_dict[sentence_dict['speaker']] = []
        speaker_dict[sentence_dict['speaker']].append(i)
    self.speaker_dict = speaker_dict
    self.sentence_dicts = sentence_dicts

So to go through the whole process:

n_gram: concat sentences of this max length in a line
text: source
default_speaker: the default speaker if no speaker in specified
separate_comma: split by comma
separate_sentence: split sentence if multiple clauses exist
parse_speaker: bool for turn on/off parse speaker
normalize: to convert common punctuation besides comma to comma
  • Default speaker is the default one if no speaker is recognized or if the speaker name does not exist in self.model_list(created from the folder names inside data/dctts, those are the dctts models)
  • The text is split into so-called lines by newline characters. Note that lines are different from sentences, since it’s possible to have to character speaking multiple sentences in a single line, and it’s important to create a structure of lines so that it’s able to recognize the nth line are spoken by a certain character, and the sentences inside that line will inherit the speaker.
  • If parse_spaker is enabled, it’s going to look at if there’s a separator of ‘:’ or ‘|’ denoting the speaker and the content, then it stores this information of line number and it’s respective speaker into a line_speaker_dict structure, and later when parsing the content itself, it becomes useful to reference its line number to the respective speaker. And after the separation, only the content remains and the line is replaced by the content so that the stuff before separator and the separator is not in the line.
  • Apply the normalize_text function to each line in lines.
def normalize_text(text):
    """Normalize text so that some punctuations that indicate pauses will be replaced as commas"""
    replace_list = [
        [r'(\w)’(\w)', r"\1'\2"],  # fix apostrophe for content from books
        [r'\(|\)|:|;| “|(\s*-+\s+)|(\s+-+\s*)|\s*-{2,}\s*', ', '],
        [r'\s*,[^\w]*,\s*', ', '],  # capture multiple commas
        [r'\s*,\s*', ', '],  # format commas
        [r'\.,', '.'],
        [r'[‘’“”]', '']  # strip quote
    ]
    for regex, replacement in replace_list:
        text = re.sub(regex, replacement, text)
    text = re.sub(r' +', ' ', text)
    text = unidecode(text)  # Get rid of the accented characters
    return text
  • so it fixes the apostrophe so that the content from the witcher book can be directly placed in here as the audiobook. It also replaces some frequent punctuation that could be a pause in sentences as commas. Finally, it strips some quotes that are not ‘ or ” but ‘’“”.
  • Since I was able to generate longer sentences with my new audio-pre-processed model, I think it’s a good idea to concatenate sentences together so that it sounds more consistent and shorter sentences and be grouped together(will be implemented later).
def separate(text, n_gram, comma, max_len=30):
    _nlp = nlp if comma else nlp_no_comma
    lines = []
    line = ''
    counter = 0
    for sent in _nlp(text).sents:
        if sent.text:
            if counter == 0:
                line = sent.text
            else:
                line = f'{line} {sent.text}'
            counter += 1

            if counter == n_gram:
                lines.append(_nlp(line))
                line = ''
                counter = 0

    # for remaining sentences
    if line:
        lines.append(_nlp(line))

    return lines
  • There’s a counter and a cache call line(different from the line in the parse text function). The text is first parsed by a Spacy NLP object with a sentencizer pipeline to separate the current text(in this case the line from the outside original lines structure that contains multiple sentences) into smaller sentences. The smaller sentences inside a line are then concatenated to ‘line’ and the counter increases if it reaches the n_gram(maybe named wrong), it appends to a list ‘lines’. So that if the n_gram of value is 2, two sentences exist in a line. If comma is enabled here, the NLP object is a different one that it would split commas.
  • And the next step is to create a sentence_dicts structure, it is an ordered list contain a dictionary of each a sentence, that includes key ‘text’ as the content, ‘speaker’, ‘begin’ when it is the first sentence in a line(longer silence) and ‘punct’ that lists all commas in a sentence to be referenced when inserting silence in combining all the synthesized audios.
  • And finally another speaker_dict is created, instead of using line number as the keys and the speaker as values in line_speaker_dict, speaker_dict uses speaker as the keys and the index of respective speaker’s sentences as a list in the values of speaker_dict. This is created so that I can batch process all the sentences belong to a single speaker easily.