We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also revealing important security implications for spoof detection systems.
翻译:我们提出了kDOT,一种在预训练语音嵌入空间中运行的离散最优传输(OT)语音转换(VC)框架。与kNN-VC和SinkVC中使用的平均策略以及MKL中采用的独立性假设不同,我们的方法利用离散OT计划的质心投影来构建源说话人与目标说话人嵌入分布之间的传输映射。我们针对传输嵌入数量进行了全面的消融研究,并系统分析了源话语与目标话语时长的影响。在LibriSpeech上的实验表明,采用质心投影的OT能够持续改善分布对齐,并且在WER、MOS和FAD指标上通常优于基于平均的方法。此外,我们证明将离散OT作为后处理步骤,可以欺骗最先进的欺骗检测器,将合成语音转换为被误判为真实语音的样本。这展示了OT在嵌入空间中的强大领域自适应能力,同时也揭示了欺骗检测系统中重要的安全隐患。