Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc
翻译:任意到任意语音转换旨在仅通过目标说话人的几个示例作为参考,将源语音转换为目标声音。近年来的方法能够产生令人信服的转换效果,但代价是增加了复杂性——使得结果难以复现和进一步研究。相反,我们保持简单。我们提出了k近邻语音转换(kNN-VC):一种直接而有效的任意到任意转换方法。首先,我们提取源语音和参考语音的自监督表示。为了转换为目标说话人,我们将源表示的每一帧替换为参考中的最近邻。最后,一个预训练的声码器从转换后的表示合成音频。客观和主观评估表明,kNN-VC在保持与现有方法相似可懂度分数的同时,提高了说话人相似度。代码、样本、训练模型:https://bshall.github.io/knn-vc