We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
翻译:我们提出SelfVC,一种利用自合成样本迭代改进语音转换模型的训练策略。此前语音转换研究主要聚焦于将语音分解为显式解耦的表征,分别编码说话人特征与语言内容。然而,使用任务特定损失项解耦语音表征以捕捉这些属性会导致信息损失。本研究提出一种框架,无需通过损失项显式解耦属性,即可利用自监督学习(SSL)与说话人验证模型生成的耦合语音表征训练可控语音转换模型。首先,我们开发从音频信号与SSL表征中提取韵律信息的技术,用于训练合成模型中的预测子模块。其次,我们提出通过自合成样本构建具有挑战性的训练目标,迭代改进语音转换合成模型的训练策略。实验证明,相较于仅使用启发式扰动输入训练的基线语音转换模型,在训练中引入此类自合成样本可提升生成语音的说话人相似度。本框架无需任何文本标注,在零样本语音转换任务中,针对合成语音的自然度、说话人相似度及可懂度评估指标均取得了最先进结果。