Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models in terms of audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data.
翻译:近年来,去噪扩散模型在各类生成模型中展现出卓越性能。然而,在语音领域,由于语音合成需要处理具有长期声学特征的高维样本,扩散模型在合成时变音频时面临复杂性和可控性方面的限制。为缓解歌唱语音合成中模型复杂性带来的挑战,我们提出了HiddenSinger——一种基于神经音频编解码器和潜在扩散模型的高质量歌唱语音合成系统。为确保高保真音频输出,我们引入了一种音频自编码器,该编码器能够将音频编码为压缩表示的音频编解码器,并从低维压缩潜在向量中重建高保真音频。随后,我们利用潜在扩散模型从乐谱中采样潜在表示。此外,我们将所提模型扩展为无监督歌唱语音学习框架HiddenSinger-U,使其能够使用无标注歌唱语音数据集进行训练。实验结果表明,本模型在音频质量上优于现有模型。同时,HiddenSinger-U能够仅通过无标注数据训练即可合成出高质量的目标说话人歌唱语音。