Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input.
翻译:如今,基于识别-合成的方法在语音转换(VC)中已相当流行。通过引入从自动语音识别(ASR)模型中提取的具有良好解耦特性的语言学特征,语音转换性能取得了显著突破。最近,利用大规模无标注语音语料库训练的自监督学习(SSL)方法已被应用于关注内容信息的下游任务,这非常适合语音转换任务。然而,SSL表征中大量存在的说话人信息会严重降低音色相似度和转换语音的质量。为解决这一问题,我们提出了一种以SSL表征为输入的高相似度任意到一语音转换方法。我们在合成模块中引入了对抗训练机制,并使用了外部无标注语料库。训练了两个辅助判别器,分别用于判断梅尔频谱序列是否经过声学模型转换,以及判断内容嵌入序列是否包含来自外部语料库的说话人信息。实验结果表明,与需要大量标注语料库进行训练的监督方法相比,我们提出的方法实现了可比的相似度和更高的自然度,且该方法适用于提升以其他SSL表征为输入的语音转换方法的相似度。