Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mismatch between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.
翻译:在语音转换中,传递语言内容并保持源语音的说话风格(如语调和情感)至关重要。然而,在仅能获取目标说话人少量语句的低资源情况下,现有语音转换方法难以满足这一要求并捕获目标说话人的音色。本文针对低资源语音转换任务提出了一种新颖的MFC-StyleVC模型。具体而言,本文新提出一种基于聚类方法生成的说话人音色约束,以在不同阶段引导目标说话人音色学习。同时,为防止对目标说话人有限数据的过拟合,感知正则化约束明确维持模型在说话风格、语言内容和语音质量等特定方面的性能。此外,引入模拟模式来模拟推理过程,以缓解训练与推理之间的不匹配。在高度表现力语音上进行的广泛实验证明了所提方法在低资源语音转换中的优越性。