Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.
翻译:近期语音合成领域的进展借助基于生成对抗网络的模型(如HiFi-GAN和BigVGAN)实现了从梅尔频谱图到高保真波形的生成。然而,这些模型计算开销大且参数量庞大。iSTFTNet通过将短时傅里叶逆变换(iSTFT)融入网络,在提升速度与参数效率的同时解决了上述局限。本文提出iSTFTNet的扩展版本HiFTNet,该模型在时频域中引入谐波加噪声源滤波器,利用预训练的基频(F0)估计网络推得的正弦源来实现快速推理。在LJSpeech上的主观评估表明,本模型显著优于iSTFTNet与HiFi-GAN,达到与真实录音相当的性能。在LibriTTS数据集上对未见说话人的测试中,HiFTNet以仅四分之一的速度和1/6的参数量超越了BigVGAN-base,并实现与BigVGAN可比的性能。本工作为高效高质量的神经声码技术树立了新标杆,为需要高质量语音合成的实时应用铺平了道路。