Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.
翻译:任意到任意歌声转换是一种有趣的音频编辑技术,其目标是在仅给定数秒歌唱数据的情况下,将一位歌手的歌声转换为另一位歌手的歌声。然而,在转换过程中,音色泄漏问题不可避免:转换后的歌声听起来仍像原歌手的嗓音。为解决此问题,本文提出了一种用于歌声转换的潜在扩散模型,该模型尝试在潜在空间中使用LDM执行歌声转换。我们基于VITS框架,利用著名的开源项目So-VITS-SVC预训练了一个变分自编码器结构,随后将其用于LDM训练。此外,我们提出了一种基于无分类器引导的歌手引导训练方法,以进一步抑制原歌手的音色。实验结果表明,在音色相似性的主客观评估中,所提方法均优于先前工作。