We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.
翻译:我们提出了一种利用WavLM表示上的K近邻检索来对齐非平行源语音和目标语音的语音转换框架,通过构建合成训练对实现监督学习。检索到的片段作为合成输入,而真实目标音频提供真实输出,形成一种合成到真实的训练范式,该范式自然支持多语言数据,无需平行语料库或显式对齐。为确保目标说话人身份的一致性,我们引入了一种源自预训练说话人验证模型的说话人损失。跨多种语言的实验表明,尽管仅使用英语数据训练,所提方法仍能实现高自然度和强说话人相似性,优于具有竞争力的语音转换基线方法。样本可访问:https://palindromic-vc.github.io。