We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also revealing important security implications for spoof detection systems.
翻译:我们提出kDOT,一种在预训练语音嵌入空间中运行的离散最优传输(OT)语音转换(VC)框架。与kNN-VC和SinkVC中使用的平均策略以及MKL中采用的独立性假设不同,我们的方法利用离散OT方案的重心投影来构建源说话人与目标说话人嵌入分布之间的传输映射。我们对传输嵌入数量进行了全面的消融研究,并系统分析了源与目标话语时长的影响。LibriSpeech上的实验表明,采用重心投影的OT能持续改善分布对齐,并在WER、MOS和FAD指标上常优于基于平均的方法。此外,我们证明将离散OT作为后处理步骤,可将伪造语音转化为被最先进伪造检测器误判为真实语音的样本。这揭示了OT在嵌入空间中强大的域适应能力,同时也暴露了伪造检测系统的重要安全隐患。