Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes. LVC-VC utilizes carefully designed input features that have disentangled content and speaker information, and it uses a neural vocoder-like architecture that utilizes LVCs to efficiently combine them and perform voice conversion while directly synthesizing time domain audio. Experiments show that our model achieves especially well balanced performance between voice style transfer and speech intelligibility compared to several baselines.
翻译:零样本语音转换正成为一个越来越受欢迎的研究课题,因为它有望将语音转换为任意说话人的声音。然而,针对该任务的端到端方法研究相对较少,而这类方法因其无需单独声码器从中间特征生成音频而具有吸引力。在本文中,我们提出LVC-VC,一种端到端零样本语音转换模型,它使用位置可变卷积来联合建模转换和语音合成过程。LVC-VC利用精心设计的输入特征,这些特征分解了内容与说话人信息,并采用类似神经声码器的架构,通过位置可变卷积高效地结合这些特征,在直接合成时域音频的同时完成语音转换。实验表明,与多种基线模型相比,我们的模型在语音风格迁移和语音可懂度之间取得了尤为均衡的性能。