Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.
翻译:摘要:语音转换作为应用于语音的风格迁移任务,是指将一个人的语音转换为听起来像另一个人的新语音。迄今为止,已有大量研究致力于更好地实现语音转换任务。然而,一个好的语音转换模型不仅应匹配目标说话人的音色信息,还应包括韵律、节奏、停顿等表达性信息。在此背景下,韵律建模对于实现听感自然且令人信服的富有表现力的语音转换至关重要。不幸的是,韵律建模虽重要但具挑战性,尤其是在没有文本转录的情况下。在本文中,我们首次提出了一种名为“PMVC”的新型语音转换框架,该框架无需文本转录即可有效分离并建模语音中的内容、音色和韵律信息。特别地,我们引入了一种新的语音增强算法以实现稳健的韵律提取。在此基础上,我们应用掩码与预测机制来解耦韵律与内容信息。在AIShell-3语料库上的实验结果支持了我们转换语音在自然度和相似度方面的提升。