Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion tasks. We use a simple but robust approach called k-nearest neighbors voice conversion (kNN-VC). We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. The latter involves converting to a target voice specified through a text description, e.g. "a young man with a high-pitched voice". Compared to an established baseline, we find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some instruments like drums but not on others. Nevertheless, this shows that voice conversion models - and kNN-VC in particular - are increasingly applicable in a range of non-standard downstream tasks. But there are still limitations when samples are very far from the training distribution. Code, samples, trained models: https://rf5.github.io/sacair2023-knnvc-demo/.
翻译:语音转换旨在利用目标说话人的录音作为参考,将源语音转换为目标嗓音。新模型正在产生日益逼真的输出。但当模型输入非标准数据(例如有语言障碍用户的语音)时,会发生什么?我们研究了一个最新语音转换模型在非标准下游语音转换任务上的表现。我们采用一种简单但稳健的方法——k近邻语音转换(kNN-VC)。我们考察了四种非标准应用:口吃语音转换、跨语言语音转换、乐器转换以及文本到语音转换。后者涉及通过文本描述(例如“一个音调高的年轻男性”)指定目标嗓音进行转换。与既定基线相比,我们发现kNN-VC在口吃语音和跨语言语音转换中保持了高性能。乐器转换和文本到语音转换任务的结果则更为参差不齐。例如,kNN-VC对鼓等某些乐器效果良好,但对其他乐器则不然。尽管如此,这表明语音转换模型——尤其是kNN-VC——在多种非标准下游任务中日益适用。但当样本与训练分布差异极大时,仍存在局限性。代码、样本、训练模型:https://rf5.github.io/sacair2023-knnvc-demo/。