Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embedding free voice conversion model, which is designed to learn and incorporate speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism, and then reconstruct waveform from HuBERT semantic tokens in a non-autoregressive manner. The concise design of SEF-VC enhances its training stability and voice conversion performance. Objective and subjective evaluations demonstrate the superiority of SEF-VC to generate high-quality speech with better similarity to target reference than strong zero-shot VC baselines, even for very short reference speeches.
翻译:零样本语音转换(Zero-shot Voice Conversion, VC)旨在在保持语言内容不变的前提下,将源说话人音色迁移至任意未见过的目标说话人音色。尽管通过提供目标说话人的说话人嵌入可以控制生成语音的语调,但说话人相似度仍落后于真实录音。本文提出SEF-VC——一种无说话人嵌入的语音转换模型,该模型通过强大的位置无关交叉注意力机制,从参考语音中学习并融合说话人音色,随后以非自回归方式从HuBERT语义标记重建波形。SEF-VC的简洁设计提升了其训练稳定性与语音转换性能。客观与主观评估表明,SEF-VC能够生成高质量语音,且在说话人相似度上优于强零样本VC基线模型,即使对于极短的参考语音,其与目标参考的相似度也表现更佳。