Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control.
翻译:近期神经语音合成与声码器技术的进展重新激发了人们对语音转换(VC)的研究兴趣。除音色迁移外,在诸多应用场景中部署VC系统时,对基频、语速等副语言参数实现可控性至关重要。然而现有研究要么仅提供语句级全局控制,要么缺乏控制的可解释性。本文提出ControlVC——首个实现基频和速度时变控制的神经语音转换系统。该系统利用预训练编码器从源语句中提取基频和语言嵌入向量,从目标语句中提取说话人嵌入向量,将这些嵌入拼接后通过声码器转换为语音。通过对源语句进行TD-PSOLA预处理实现语速控制,通过将基频轮廓输入基频编码器前对其进行操作实现基频控制。我们开展了系统的客观和主观评估以衡量语音质量与可控性。结果表明,在非平行和零样本转换任务中,ControlVC在语音质量上显著优于其他两个自建基线系统,并能成功实现基频和语速的时变控制。