feat!: when preprocessing, everyvoice forces equal length time and feature reps
audio must be divisible by the declared hop size
the number of frames in the spectrogram must exactly equal the number of samples in the audio when multiplied by the hop size