We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.
翻译:我们提出HybridVC,一种基于预训练条件变分自编码器(CVAE)的语音转换框架,该框架融合了潜在模型与对比学习的优势。HybridVC支持文本与音频提示,能够实现更灵活的语音风格转换。该框架通过预训练说话人编码器获取说话人嵌入,并以此为基础建模潜在分布,同时通过对比学习并行优化风格文本嵌入,使其与说话人风格信息对齐。因此,HybridVC可在有限计算资源下高效训练。实验表明,HybridVC具有卓越的训练效率,并能够实现先进的多模态语音风格转换。这凸显了其在社交媒体平台等场景中(如用户自定义个性化语音)的广泛应用潜力。全面的消融研究进一步验证了我们方法的有效性。