Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at \url{https://hayeong0.github.io/VoicePrompter-demo/}.
翻译:尽管近期的语音转换系统取得了显著进展,但在零样本场景下提升说话人相似度仍具挑战。这一挑战源于零样本环境中对语音中说话人特征的泛化与适配困难,且训练与推理过程之间的不匹配进一步加剧了该问题。为解决这些挑战,我们提出了VoicePrompter——一个利用语音提示进行上下文学习的鲁棒零样本语音转换模型。VoicePrompter由两部分构成:(1) 解耦语音成分的因子分解方法;(2) 基于DiT的条件流匹配解码器,该解码器以这些分解特征和语音提示为条件。此外,(3) 我们通过潜在空间混合技术,融合不同说话人特征以增强上下文学习能力。该方法通过对潜在表示应用混合操作,提升了零样本语音转换中的说话人相似度与自然度。实验结果表明,VoicePrompter在说话人相似度、语音可懂度和音频质量方面均优于现有零样本语音转换系统。演示页面请访问:\url{https://hayeong0.github.io/VoicePrompter-demo/}。