HDF5, h5py, Zarr for packing pre-processed audio data

I feel like I have deviated a bit from my project and start improving on the engineering side. But I’ve decided to think of something to create a dictionary-like structure to store those processed NumPy arrays. Turns out the h5py library can do just that. After creating a dataset file, I encountered an error saying that the object can’t be read.

It seems that due to the fact that the training program creates 8 processes to accelerate data handling, originally it was reading a separate file per process, now it is reading the same file for all processes to access an array stored inside. And that would cause some problems. People have said to either limit a process to one(single-threaded) or you will have to enable SWMR mode.

And I’ve tried Zarr, which I thought would improve performance, but in fact, it was more suitable for a large file of the single array instead of a huge amount of files. It’s more suitable for a single huge array structure, that means, all the arrays must be stored in a gigantic array, and a separate label structure indicates the index in the large array. The reason why it is slow is that is doesn’t provide a single file structure like h5, but a Zip file instead, which will need to be unpacked even though the data inside was still uncompressed. I’ve wasted a bit of time delete all my hdf5 files and create Zarr zip files, and in the end, I stick with h5.

pre-process part in audio.py:

# modified for h5py
def preprocess(voice_path):
    """Preprocess the given dataset."""
    with h5_loader(f'{voice_path}/mels.h5', mode='w') as mels, h5_loader(f'{voice_path}/mags.h5', mode='w') as mags:
        wavs_list = glob.glob(f'{voice_path}/wavs/*.wav')
        for wav_path in tqdm(wavs_list):
            mel, mag = get_spectrograms(wav_path)

            t = mel.shape[0]
            # Marginal padding for reduction shape sync.
            num_paddings = hp.reduction_rate - (t % hp.reduction_rate) if t % hp.reduction_rate != 0 else 0
            mel = np.pad(mel, [[0, num_paddings], [0, 0]], mode="constant")
            mag = np.pad(mag, [[0, num_paddings], [0, 0]], mode="constant")
            # Reduction
            mel = mel[::hp.reduction_rate, :]

            fname = os.path.splitext(os.path.basename(wav_path))[0]
            mels.create_dataset(fname, data=mel, dtype='float32', fletcher32=True)
            mags.create_dataset(fname, data=mag, dtype='float32', fletcher32=True)

And there was a bit of a problem when trying to instantiate the file object inside the inherited PyTorch dataset class, it would give read error whenever I tried to run it. It was solved by creating the object inside the __get_item__ method (the square brackets selector []), but that also means whenever a need to load an array, it would need to create an object just to read that array, which is super not efficient. So I research some of the underlying concepts of Pytorch Dataset and Dataloader objects, and realize that the dataset is instantiated by num_workers of the Dataloader, which is eight in my case, to create a multi-threaded process. So the answer is quite simple, the file object should always be the same one, and all those threads should use the same file object, and I moved the file object instantiate part before loading the dataset, and pass it as an argument to my dataset class so it will refer to the same file object no matter how many dataset classes are instantiated.

And to further boost the performance, by specifying the driver=’core’ in h5py, the whole file will be loaded into memory. And I made a function to check the filesize and return an h5py file object if it’s less than 4GB.

custom torch Dataset class in speech.py:

class Speech(Dataset):
    keys = None
    path = None
    mels = None
    mags = None
    _fnames = None
    _text_lengths = None
    _texts = None
    vocab = hp.vocab

    def __init__(self, start, end):
        self.fnames = Speech._fnames[start:end]
        self.text_lengths = Speech._text_lengths[start:end]
        self.texts = Speech._texts[start:end]

    @classmethod
    def singleton(cls, name):
        if getattr(cls, name) is None:
            setattr(cls, name, h5_loader(os.path.join(cls.path, f'{name}.h5')))

    @classmethod
    def load(cls, keys, dir_name, file):
        cls.keys = keys
        cls.path = os.path.join(os.path.dirname(os.path.realpath(__file__)), dir_name)
        cls._fnames, cls._text_lengths, cls._texts = read_metadata(os.path.join(cls.path, file))

    @classmethod
    def get_script_length(cls):
        return len(cls._fnames)

    def __len__(self):
        return len(self.fnames)

    def __getitem__(self, index):
        data = {}
        if 'texts' in self.keys:
            data['texts'] = self.texts[index]
        if 'mels' in self.keys:
            Speech.singleton('mels')
            data['mels'] = Speech.mels[self.fnames[index]][:]
        if 'mags' in self.keys:
            Speech.singleton('mags')
            data['mags'] = Speech.mags[self.fnames[index]][:]
        if 'mel_gates' in self.keys:
            data['mel_gates'] = np.ones(data['mels'].shape[0], dtype=np.int)  # TODO: because pre processing!
        if 'mag_gates' in self.keys:
            data['mag_gates'] = np.ones(data['mags'].shape[0], dtype=np.int)  # TODO: because pre processing!
        return data

h5_loader function in util.py: (some are experimental and are commented out)

def h5_loader(filepath, mode='r'):
    additional = {}
    if mode == 'r':
        if os.stat(filepath).st_size < hp.max_load_memory:  # less than 4GB will be load into memory
            additional['driver'] = 'core'
        else:
            additional['swmr'] = True
            # additional['rdcc_w0'] = 0.4
            # additional['rdcc_nslots'] = 4019
            # additional['rdcc_nbytes'] = 40960 ** 2
    return h5py.File(filepath, mode=mode, libver='latest', **additional)

Feels good to finally solve this! It makes that training smoother and easier when I need to do it on google colab. And also I must finish the web app ASAP.