We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.
翻译:本文提出HybridVC,一种基于预训练条件变分自编码器(CVAE)的语音转换框架,该框架通过结合隐变量模型与对比学习的优势,实现了对文本与音频提示的支持,从而能够进行更灵活的语音风格转换。HybridVC基于预训练说话人编码器获取的说话人嵌入来建模条件隐变量分布,并通过并行对比学习优化风格文本嵌入以对齐说话人风格信息。因此,HybridVC能够在有限计算资源下高效训练。实验结果表明,HybridVC具有卓越的训练效率及先进的多模态语音风格转换能力,这凸显了其在各类社交媒体平台中实现用户自定义个性化语音等广泛应用的潜力。全面的消融实验进一步验证了本方法的有效性。