In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10 seconds of data for a target speaker, our framework can perform voice swapping and achieves a speaker verification EER of 5.5% for seen speakers and 8.4% for unseen speakers.
翻译:本文提出一种利用自监督学习训练的语音表示进行零样本语音转换的方法。首先,我们开发了一个多任务模型,将语音语句分解为语言内容、说话人特征和说话风格等特征。为解耦内容和说话人表征,我们提出了一种基于孪生网络的训练策略,该策略鼓励原始音频与变调音频的内容表征之间的相似性。接着,我们构建了一个包含音高和时长预测器的合成模型,该模型能从分解后的表征中有效重构语音信号。我们的框架支持可控和说话人自适应的语音合成,能够实现零样本任意到任意语音转换,在评估说话人相似度、可懂度和自然度的指标上达到了最先进水平。仅需10秒目标说话人数据,我们的框架即可完成语音交换,在说话人验证任务中,对于已见说话人的等错误率为5.5%,对于未见说话人则为8.4%。