This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.
翻译:本文提出FastFit,一种新型神经声码器架构,通过用多个短时傅里叶变换替代U-Net编码器,在保证样本质量的前提下实现更快的生成速率。我们将每个编码器模块替换为STFT,其参数与各解码器模块的时间分辨率相对应,从而构成跳跃连接。FastFit在保持高保真度的同时,将模型参数量和生成时间减少了近一半。通过客观与主观评估,我们证明所提模型在保持高音质的同时,生成速度达到基线迭代式声码器的近两倍。我们进一步展示了FastFit在文本转语音评估场景(包括多说话人和零样本文本转语音)中能够产生与其他基线相当的音质。